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1.  INTRODUCTION 

In  this  annual  report  we  present  our  work  performed  during  the 
period  April  6,  1978  to  June  7,  1979  in  the  area  of  speech 

compression  and  synthesis. 


In  Section  1.1  we  give  a  very  brief  list  of  the  major 
accomplishments  in  the  past  year.  The  reader  is  referred  to  the 
body  of  the  report  for  details  on  these  as  well  as  other 
accomplishments.  An  outline  of  this  report  is  given  in  Section 
1.2.  In  Section  1.3,  we  give  a  list  of  the  presentations  and 
publications  for  the  past  year.  The  publications  are  included  in 
the  Append  ix  . 


1.1  Major  Accomplishments 

a)  Developed  RTFUD,  an  RT-1 1/FORTRAN  debugging  environment  for  the 
AP-120B  vocoder  program. 

b)  Brought  up  the  ISI  LPC-II  real-time  vocoder  on  our  system. 
Modified  the  vocoder  to  include  the  optimum-1  inear- fit 
variable-frame-rate  (VFR)  algorithm  and  the  mixed-source  model , 
both  developed  at  BBN. 

c)  Implemented  and  tested  a  range  of  adaptive  autocorrelation  and 
lattice  algorithms  for  LPC  analysis,  which  have  implications 
for  new  hardware  developments  (such  as  switched-capacitor 
designs)  . 

d)  Developed  a  new  high-frequency  regeneration  (HFR)  technique, 
spectral  duplication,  to  be  used  in  baseband  coders  in  the 
range  above  5000  bits/s. 

e)  Designed  and  implemented  a  program  for  phonetic  synthesis  from 
diphone  templates.  The  program  was  tested  successfully  using 
an  initial  set  of  200  diphones.  Designed  and  recorded  a  data 
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base  comprising  a  complete  set  of  diphones  (about  3000)  for 
English.  Began  the  lengthy  process  of  extracting  the  diphone 
templates;  we  have  completed  the  extraction  of  1500  diphones. 


1.2  Outline 


In  Section  2  we  present  our  work  towards  bringing  up  a 
real-time  vocoder  and  improving  its  output  speech  quality.  Section 
3  describes  the  results  of  new  techniques  for  extraction  and  coding 
of  linear  prediction  (LPC)  parameters.  A  new  high-frequency 
regeneration  (HFR)  technique  is  presented  in  Section  4  as  a  source 
model  to  be  used  with  baseband  coders.  Finally,  in  Section  5,  we 
present  our  progress  in  developing  the  phonetic  synthesis  component 
of  a  very  low  rate  (VLR)  coder. 

1.3  Presentations  and  Publications 

During  the  past  year,  we  gave  a  number  of  oral  presentations 
at  the  regular  ARPA  Network  Speech  Compression  (NSC)  Meetings.  In 
addition,  we  made  four  presentations  at  conferences  and  had  two 
papers  published.  These  were: 

a)  J.  Makhoul,  "A  Class  of  All-Zero  Lattice  Digital  Filters: 
Properties  and  Applications,"  IEEE  Trans.  Acoustics,  Speech  and 
Signal  Processing,  pp .  304-314,  Aug.  1978. 

b)  J.  Makhoul,  R.  Viswanathan,  R.  Schwartz,  and  A.W.F.  Huggins,  "A 
Mixed-Source  Model  for  Speech  Compression  and  Synthesis,"  J. 
Acoust.  Soc.  Am.,  Vol.64,  No. 6,  pp.  1577-1581,  Dec.  1978. 

c)  J.  Makhoul  and  M.  Beyrouti,  "Predictive  and  Residual  Encoding 
of  Speech,"  invited  paper,  J.  Acoust.  Soc.  Am.,  Vol.64, 
Supplement  No.l,  paper  YY2,  p.  S128,  Fall  1978. 
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d)  A.W.F.  Huggins,  R.M.  Schwartz,  R.  Viswanathan,  and  J.  Makhoul, 
I  "Subjective  Quality  Testing  of  a  New  Source  Model  of  LPC 

’  Vocoders,"  J.  Acoust,  Soc .  Am.,  Vol.64,  Supplement  No .  1 ,  paper 

GGG12,  p.  S161,  Fall  1978. 

I  e)  J.  Makhoul  and  M.  Beyrouti,  "High-Frequency  Regeneration  in 

Speech  Coding  Systems,"  IEEE  Int.  Conf.  Acoustics,  Speech,  and 
^  Signal  Processing,  Washington,  DC,  pp .  428-431,  April  1979. 

f)  R.  Schwartz,  J.  Klovstad,  J.  Makhoul,  D.  Klatt,  and  V.  Zue, 
"Diphone  Synthesis  for  Phonetic  Vocoding,"  IEEE  Int.  Conf. 
Acoustics,  Speech,  and  Signal  Processing,  Washington,  DC,  pp . 
891-894,  April  1979. 

Copies  of  papers  a)  ,  b)  ,  e)  ,  and  f)  ,  are  given  in  the  Appendix. 
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2.  REAL  TIME  VOCODER 

Much  of  our  effort  of  the  past  year  has  been  devoted  to 
bringing  up,  modifying,  and  evaluating  a  real-time  ARPANET  vocoder. 
The  original  vocoder  had  been  implemented  at  the  Information 
Sciences  Institute  (ISI),  using  an  FPS-120B  Array  Processor  to 
perform  the  numerical  calculations  and  a  PDP-11  to  transmit  and 
receive  coded  speech  across  the  network.  The  vocoder  required  a 
special  operating  system,  EPOS,  also  designed  and  implemented  at 
ISI,  to  handle  the  network  transactions. 

2. 1  Installation 

Installing  the  vocoder  at  BBN  required  several  steps.  We 
first  had  to  obtain  the  necessary  hardware  and  software  and  then 
configure  both  to  our  particular  needs. 

2.1.1  Operating  System  Installation 

The  process  of  installing  EPOS  on  our  PDP-11  was  complicated 
by  the  fact  that  our  PDP-11  was  quite  different  from  the  one  used 
at  ISI.  They  used  an  11/45,  while  we  have  an  11/40.  The  major 
differences  are  in  the  effective  address  spaces  available  on  the 
two  machines.  ISI  streamlined  their  code  as  much  as  possible  to 
fit  into  our  address  space,  and  both  sites  spent  a  good  deal  of 
time  debugging  the  BBN  configuration  of  EPOS. 
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2.1.2  Hardware  Installation 

In  the  process  of  debugging  EPOS,  we  discovered  that  we  did 
not  have  enough  physical  memory  on  our  PDP-11.  We  purchased  and 
installed  an  additional  64K  words  of  MOS  ,  bringing  the  total 
memory  on  the  PDP-11  to  128K. 

In  January  1978,  Floating  Point  Systems,  Inc.  installed  an 
AP-120B  array  processor  on  our  PDP-11.  We  installed  the  AP  program 
development  software  and  library  under  RTll,  and  tested  and 
accepted  the  system. 

2.1.3  Program  Development  Software  Installation 

The  program  development  software  required  to  install  the 
vocoder  program  includes  the  AP  assembler,  APAL,  and  the  AP  linker, 
APLINK.  An  unanticipated  difficulty  arose  over  this  software.  ISI 
performs  all  assembling  and  linking  on  their  TENEX  system,  and  our 
PDP-11  system  would  not  accept  the  formats  of  the  source  files  used 
by  ISI.  In  addition,  they  had  modified  some  of  the  FPS  support 
software,  which  also  ran  under  TENEX.  We  were  forced  to  modify  all 
of  this  support  software  to  run  under  the  TOPS-20  operating  system. 

2.1.4  Vocoder  Program  Installation 

The  successful  modification  of  the  support  software  enabled  us 
to  make  several  changes  to  the  vocoder  program.  These  changes  to 
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the  vocoder  program  were  required  to  accommodate  differences 
between  the  ISI  and  BBN  systems.  The  most  significant  difference 
is  that  ISI  uses  A/D  and  D/A  converters  interfaced  directly  to  the 
AP-120B  Array  Processor,  while  BBN ' s  converters  are  interfaced  to 
the  SPS-41.  ISI  had  initially  used  the  SPS-41  for  analog/digital 
conversion,  but  for  later  versions  of  the  vocoder,  they  switched  to 
the  AP-120B.  This  switch  required  identifying  and  changing  various 
routines  that  use  the  converters.  EPOS  modifications  were  also 
necessary  to  allow  access  to  the  SPS-41. 

2.1.5  Initial  Vocoder  Experiments 

Using  FUD,  an  ISI-developed  debugger  for  the  AP-120B,  we 
tested  the  AP-120B  coder  for  the  vocoder.  This  test  did  not  use 
the  ARPA  Network,  but  merely  fed  the  parameters  obtained  from  the 
analysis  portion  of  the  AP-120B  program  back  to  the  synthesis 
portion  of  the  program.  We  obtained  moderately  intelligible  output 
speech  in  real-time.  We  were  also  able  to  store  analysis 
parameters  on  a  PDP-11  data  file  and  subsequently  transmit  that 
file  to  ISI,  who  performed  the  synthesis.  Using  this  testing 
procedure,  we  found  and  fixed  bugs  in  the  converter  routines.  We 
were  able  to  obtain  good  quality  speech  using  the  vocoder  in  this 
"back-to-back"  manner.  We  further  tested  the  vocoder  system  by 
carrying  on  a  conversation  with  ISI  over  the  ARPA  Network,  and  we 
also  used  the  "echo"  facility  of  the  vocoder  program  to  listen  to 
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ourselves  over  the  network.  This  testing  procedure  pointed  out  a 
potentially  serious  problem.  The  PDP-11/40  seemed  unable  to  keep 
j  up  with  the  real-time  speech  requirements.  Because  ISI  uses  a 

PDP-11/45,  a  significantly  faster  machine,  they  had  not  encountered 
this  problem.  We  were  able  to  change  the  priorities  of  certain 
tasks  in  the  PDP-11  portion  of  the  vocoder,  and  to  speed  up  the 
coder  somewhat,  so  that  we  no  longer  have  timing  problems  in 
point-to-point  conversations.  However,  there  are  still  real-time 
problems  that  surface  during  conferencing. 

The  network  experiments  also  demonstrated  that  the  speech 
quality  was  considerably  worse  over  the  network  than  that  obtained 
from  the  "back-to-back"  system.  The  problem  was  determined  to  be 
in  BBN's  receiver.  With  the  help  of  ISI,  we  were  able  to  find  and 
fix  a  subtle  bug  in  the  real-time  buffering  code.  This  bug 
resulted  from  the  configuration  differences. 

We  also  participated  in  several  conferencing  experiments  with 
Lincoln  Laboratories  and  ISI.  Unfortunately,  due  to  the  lack  of 
speed  in  the  11/40  mentioned  above,  we  were  unable  to  communicate 
with  both  of  the  other  participants.  We  are  currently  working  with 
ISI  to  find  modifications  that  will  support  conferencing  on  our 
conf  ig  uration . 
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2.2  Vocoder  Modifications 

Once  we  had  established  the  "current"  ISI  version  of  the 
vocoder  at  BBN,  we  were  able  to  modify  it  to  include  some  of  the 
algorithmic  recommendations  we  had  made  previously.  At  the  same 
time,  we  attempted  to  make  the  vocoder  more  modular,  and  to  keep 
our  modifications  as  modular  as  possible. 

2.2.1  Support  Software 

The  difficulty  we  experienced  in  debugging  the  initial  vocoder 
convinced  us  that  more  effective  debugging  and  testing  tools  were 
necessary.  We  also  found  that,  while  it  was  fine  for  real-time 
operation,  EPOS  had  serious  drawbacks  as  a  debugging  environment. 
The  most  effective  debugging  could  be  done  if  the  vocoder  were 
running  in  the  Fortran  environment  under  RT-11. 

2. 2. 1.1  APEX  and  APLINK 

The  most  general  system  for  using  the  AP-120B  for  array 
processing  is  the  AP  Executive  (APEX)  supplied  by  Floating  Point 
Systems  Inc.  APEX  allows  a  host  (PDP-11)  FORTRAN  program  to 
transfer  data  to  and  from  the  AP-120B  and  to  specify  computational 
operations  to  be  performed  on  the  data  by  the  AP-120B.  The  APEX 
library  (APLIB)  includes  an  extensive  repertoire  of  elementary 
computational  operations,  and  the  user  may  program  his  own 
functions,  to  be  used  in  the  same  way. 
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^  The  APEX  system,  as  delivered  by  FPS,  is  written  in  FORTRAN, 

which  on  our  PDP-11  operating  system  (RT-11)  produces  inefficient 
I  code.  As  a  result,  each  APEX  routine  called  from  the  Host  FORTRAN 

program  consumes  2  to  4  ms  of  PDP-11  time  just  to  start  the 
I  computation  process  in  the  AP-120B.  Many  of  these  elemental 

operations  require  on  the  order  of  1  ms  to  run  in  the  AP-120B,  so 
*  most  of  the  time  for  running  a  multicall  APEX  program  is  spent  in 

PDP-11  APEX  overhead. 

Analysis  of  the  operations  of  APEX  in  starting  an  AP-120B 
computation  showed  that  efficiency  could  be  gained  from  (1) 
recoding  the  time-critical  sections  of  APEX  in  machine  language  to 
avoid  inefficient  FORTRAN  code  and  to  eliminate  the  time-consuming 
nesting  of  FORTRAN  subroutine  calls  and  (2)  streamlining  the  manner 
in  which  APEX  is  called  by  each  individual  user-called  routine, 
once  again  eliminating  needless  nesting  of  FORTRAN  subroutine 
calls.  The  latter  was  accomplished  by  modifying  APLINK,  one  of  the 
AP-120B  program  development  programs.  The  resulting  modified 
versions  of  these  programs,  APEX2  and  APLINK2,  were  then  tested, 
and  the  overhead  associated  with  an  APEX  call  was  found  to  have 
been  reduced  to  0.5  to  1.0  ms. 
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2.  2.  1.2  SPS-FPS  Support 

We  developed  and  tested  a  package  of  PDP-11  subroutines  that 
control  transfers  between  the  SPS-41  converters  and  the  AP-120B 
Main  Data  memory.  These  subroutines  automatically  load  SPS-41 
program  memory,  initialize  the  sampling  rate  counters,  and  manage 
the  double  buffering  required  for  real-time  operation.  They  also 
return  input  and  output  data  magnitude  information  to  the  calling 
program  and  indicate  whether  the  real-time  constraints  have  been 
met,  and,  if  not,  how  much  additional  time  was  required.  These 
subroutines  have  been  designed  to  operate  under  APEX,  the 
FPS-supplied  executive. 

2. 2. 1.3  IMSYS 

Graphic  displays  are  an  invaluable  aid  to  signal  processing 
research.  Since  1971,  we  have  been  using  an  IMLAC  PDS-1  display 
computer  in  conjunction  with  our  PDP-10  systems  for  interactive 
graphics  support  of  our  speech  and  signal  processing  research.  A 
major  part  of  the  support  software  for  the  use  of  this  graphics 
system  has  been  IMSYS,  a  general-purpose  graphics  control  program 
for  the  IMLAC,  which  creates  and  maintains  displays  specified  by 
user  programs  operating  in  the  host  PDP-10.  IMSYS  graphics 
packages  that  are  usable  by  user  programs  written  in  INTERLISP, 
FORTRAN,  and  BCPL  have  put  interactive  graphics  at  the  disposal  of 
a  wide  variety  of  user  programs. 
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In  order  to  use  this  graphics  facility  in  our  testing 
procedures,  we  extended  the  usability  of  IMSYS  graphics  to  FORTRAN 
programs  running  on  our  Speech  Processing  PDP-11/40.  The  IMLAC 
part  of  IMSYS  is  used  without  modification.  The  host  IMSYS 
graphics  package  was  taken  largely  intact  from  another  BBN  project 
that  had  implemented  IMSYS  in  RSX-llD  FORTRAN;  only  new 
host- to-IMLAC  data  exchange  routines  and  minor  housekeeping  changes 
were  necessary. 

2. 2. 1.4  RTFUD 

With  the  preliminary  support  software  developed,  we  were  able 
to  implement  RTFUD,  a  RT-ll/FORTRAN/debugg  ing  environment  for  the 
AP-120B  vocoder  program.  RTFUD  operates  the  vocoder  in 
non-real- time ,  using  files  for  input  and  output  of  digitized  speech 
waveforms  and/or  transmission  parameters.  Non-reai-tirae  operation 
implies  that  the  vocoder  can  be  stopped,  examined,  and  then 
continued.  Furthermore,  the  implementation  in  FORTRAN  has 
permitted  the  rapid  implementation  of  a  number  of 
statistics-gathering  and  diagnostic  tools.  RTFUD  is  used  with  AP 
FDT,  a  FORTRAN  debugging  program  (FDT)  that  we  have  modified  to 
permit  it  to  debug  the  AP-120B  as  well  as  the  PDP-11  program.  The 
facilities  of  RTFUD  have  uncovered  several  bugs  and  problems  in  the 
AP-120B  vocoder  program,  and  they  have  been  used  for  developing  the 
variable  frame  rate  (VFR)  and  synthesizer  modifications  described 
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below.  This  section  gives  a  brief  description  of  the  facilities 
offered  by  the  current  implementation  of  RTFUD. 

Figure  2.1  illustrates  the  relation  of  the  real-time  AP-120B 
program  to  the  RTFUD  implementation.  In  the  real-time 
implementation,  the  AP-120B  vocoder  program  is  organized  as  a 
closed  loop.  In  RTFUD,  the  analysis  (XANAL)  and  synthesis  (XSYNTH) 
portions  of  the  program  are  separately  and  explicitly  controlled  by 
the  RTFUD  program  (by  means  of  APEX,  the  FPS-supplied  "AP-120B 
Executive")  in  the  PDP-11.  The  analysis  and  synthesis  portions  of 
the  vocoder  programs  are  effectively  identical  in  both  cases. 
Therefore,  developments  made  to  the  AP-120B  programs  in  the 
non- real- time  configuration  are  directly  transferable  to  the 
real-time  system;  RTFUD  does  not  compromise  the  real-time  goals  of 
the  vocoder  development  project. 

The  basic  mode  of  operation  of  the  RTFUD  vocoder  is 
"back-to-back",  with  digitized  speech  input  from  a  file  processed 
by  the  analysis  and  synthesis  portions  of  the  vocoder,  producing 
synthesized  speech  output  to  a  file.  RTFUD  also  provides  a 
real-time  playback  facility,  which  converts  a  disk  file  to  speech 
for  listening  purposes.  RTFUD  also  provides  for  "analysis"  and 
"synthesis"  modes  of  operations,  in  which  the  respective  output  or 
input  is  a  file  containing  transmission  data  (which  in  an  actual 
real-time  system  would  be  transmitted  through  the  communications 
network) . 
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Real-time  RTFUD 

SYSTEM  SYSTEM 


Fig.  2.1  Comparison  of  the  AP-120B  vocoder  program  in  the  real-time 
system  and  in  the  RTFUD  system. 

Figure  2.2  illustrates  the  value  of  being  able  to  use  a  known, 
repeatable  input  to  the  vocoder  and  being  able  to  observe  the  input 
and  output  waveforms  in  detail.  The  figure  shows  three  versions  of 
the  waveform  of  the  syllable  "mass"  from  the  word  "Massachusetts". 
The  top  trace  is  the  original  digitized  speech;  the  center  trace 
shows  the  same  syllable  at  the  output  of  the  ISI  vocoder.  A 


-  13  - 


Report  No.  4159 


Bolt  Beranek  and  Newman  Inc. 
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Fig.  2. 2  Waveforms  of  the  syllable  "mass"  in  "Massachusetts"  :  (top) 
original;  (center)  vocoded  with  gain  bug;  (bottom)  vocoded 
with  gain  bug  fixed. 

program  bug  has  inappropriately  lowered  the  gain  about  half-way 
through  the  syllable.  By  editing  the  input  file  to  contain  just 
this  syllable,  and  observing  certain  vocoder  memory  locations  after 
each  frame  of  speech,  it  was  an  easy  task  to  localize  and  correct 
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this  bug  in  the  synthesis  interpolation  routine.  The  output  speech 
after  this  bug  was  fixed  is  shown  in  the  bottom  trace  of  Figure 

2.2. 


RTFUD  accumulates  several  statistics  of  the  vocoder,  which  can 
be  valuable  in  understanding  and  adjusting  the  operation  of  the 
vocoder.  Figure  2.3  illustrates  the  type  of  statistics  currently 
pr inted . 


>BACK-TO-BACK 

INPUT  SPEECH  FILBt  DKltSC2A.WAV 

54501.  SAMPLES  f  150  USEC  •  8.175  SEC 

OUTPUT  SPEECH  PILE*  ♦DKlrSC2AB.WAV 

872  FRAMES 

PDP-11  TIME  -  35.9  MS/FRAME 

AP-120B  TIME  (ANALYSIS)  -  3.4  MS/FRAME 

AP-120B  TIME  (SYNTHESIS)  -  2.0  MS/FRAME 

ANALYSIS*  27.8  BITS/FRAME#  2895.  BITS/SEC 
SYNTHESIS*  27.8  BITS/FRAME#  2895.  BITS/SEC 

PITCH#  GAIN*  K*S»  44.4  51.2  57.2  FRAMES/SBC  (OUT  OF  104.2) 


Fig.  2.3  Statistics  summary  output  by  RTFUD. 


RTFUD  contains  provisions  for  accumulating  histogram  data  on 
the  transmitted  parameters.  Figure  2.4  shows  a  histogram  display 
derived  'from  vocoding  about  10  sec  of  speech  from  one  (male) 
speaker.  (The  histogram  display  was  made  with  IMSYS,  the  graphics 
system  implemented  on  our  PDP-11.)  Several  valuable  observations 
about  the  vocoder  operation  may  be  made  from  this  figure.  Note 
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Fig.  2.4  Histogram  display  of  transmitted  parameters,  single 
(male)  speaker. 

that  the  pitch  period  histogram  is  distinctly  bimodal .  The 
right-hand  lump  represents  correct  pitch  periods  for  this  speaker, 
and  the  left-hand  lump  represents  octave  errors  (pitch  periods  of 
half  the  correct  period).  Also  note  the  K1  (first  reflection 
coefficient)  histogram.  It  shows  that  the  highest  coded  value  of 
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K1  is  used  a  disproportionate  amount  of  the  time.  In  other  words, 
the  quanti2ation  tables  appear  not  to  encompass  the  full  range  of 
K1  computed  by  the  analysis  portion  of  the  vocoder.  Figure  2.5 


PITCH..  CAIN.  Kl.  iC2. 


IC7.  Ka.  K9. 


J 

Fig.  2.5  Histogram  display  of  transmitted  parameters,  single 
(female)  speaker. 

shows  a  comparable  histogram  for  another  (female)  speaker.  The 
pitch  period  histogram  shows  shorter  pitch  periods,  as  expected, 
and  few  (if  any)  octave  errors.  The  K1  histogram  shows  the  same 
effect  as  before,  and  for  this  speaker,  K2  and  K4  are  also  severely 
skewed  . 
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RTFUD  contains  provisions  for  producing  frame-by-frame 
listings  of  various  vocoder  parameters  such  as  analysis  parcels 
(before  VFR  decisions),  synthesis  parcels  (after  VFR  decisions), 
and  unquantized  reflection  coefficients.  Because  the  FORTRAN 
environment  of  RTFUD  facilitates  such  formatted  output,  during  the 
testing  of  the  new  VFR  algorithm  it  was  a  simple  matter  to  add  a 
special  printout  of  the  VFR  tables  so  that  the  operation  of  the  VFR 
algorithm  could  be  monitored  and  verified. 

RTFUD  also  contains  commands  for  controlling  such  things  as 
enabling  or  disabling  VFR  operation  of  the  vocoder,  controlling 
double-threshold  pitch  and  gain  transmission  (not  previously 
implemented),  ar.J  adjusting  the  VFR  thresholds  for  the  desired 
transmission  rate. 

We  have  made  RTFUD  available  to  the  other  Network  Speech 
Compression  sites.  Currently,  ISI  is  in  the  process  of  bringing  it 
up  on  their  system,  under  RT-11  version  3.  They  are  experiencing 
some  compatibility  problems  between  version  3  and  version  2,  under 
which  we  developed  RTFUD. 

2.2.2  Lattice  Synthesis  Modification 

We  modified  the  vocoder  to  use  a  lattice-form  synthesis  filter 
instead  of  the  normalized  form  that  was  previously  used.  The 
normal  ized-form  filter  was  advantageous  in  the  SPS-41  vocoder 
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implementation,  but  not  in  the  AP-120B.  The  lattice- fo rm  requires 
only  two  multiplies  per  stage,  compared  to  four  for  the 
normalized- form.  We  used  the  computation  time  saved  by  this 
reduction  to  make  the  synthesis  routine  more  modular.  Where 
before,  the  arguments  of  the  routine  were  hand  compiled  and  were  an 
integral  part  of  the  design  of  the  vocoder,  they  are  now  passed  to 
the  subroutine  and  can  be  general. 

The  no rmal i zed- fo rm  filter  required  the  arc-sine  of  the 
reflection  coefficients,  used  as  pointers  to  the  filter 
coefficients.  Decoding  of  transmitted  parameters  and  linear 
interpolation  were  performed  in  this  arc-sine  domain.  The 
lattice-form  filter  uses  the  reflection  coefficients  themselves. 
However,  since  the  new  VFR  algorithm  assumes  linear  interpolation 
in  the  LAR  domain  (for  reasons  of  spectral  sensitivity)  ,  we 
modified  the  vocoder  synthesizer  to  decode  and  interpolate  in  this 
domain.  We  then  added  a  subroutine  to  convert  from  LARs  to 
reflection  coefficients  for  the  synthesis  filter  computations. 

2.2.3  New  Variable  Frame-Rate  Modification 

We  implemented  and  tested  the  new  VFR  algorithm  described  in 
[1].  We  have  successfully  modified  the  vocoder  to  use  this  new 
algorithm. 
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The  new  VFR  algorithm  operates  by  modeling  what  would  occur  in 
the  synthesis  process  if  the  current  packet  were  transmitted.  If 
the  average  difference  between  the  actual  LAR  values  since  the  last 
transmitted  frame  and  the  interpolated  values  is  less  than  a 
threshold  (implying  that  linear  interpolation  between  the  end 
points  is  satisfactory),  no  transmission  occurs.  If  the  average 
distance  is  greater  than  the  threshold,  implying  that  the  current 
frame  is  not  part  of  the  previous  trend,  the  previous  frame  is 
transmitted.  Thus,  the  new  algorithm  finds  suitable  end  points  for 
line  segments  that  reasonably  approximate  the  computed  LARs  .  These 
end  points  are  then  transmitted. 

We  evaluated  the  speech  quality  produced  by  the  real-time 
vocoder  using  the  new  Variable  Frame  Rate  (VFR) 

(optimal-linear-fit)  algorithm.  We  found,  using  RTFUD,  that  the 
vocoder  using  the  older  form  of  VFR  transmission  transmits 
continuous  speech  at  a  rate  of  about  2500  bits/second,  using  the 
nominal  value  for  the  VFR  threshold.  At  a  transfer  rate  this  high, 
we  hear  little,  if  any,  quality  improvement  with  the  new  VFR 
algorithm.  However,  for  transmission  rates  below  2000  bits/second, 
the  new  method  produces  speech  that  is  clearly  superior  to  that 
produced  by  the  old  method,  as  shown  in  Figure  2.6.  The  quality 
produced  by  the  old  method  degrades  rapidly  as  the  transmission 
rate  is  reduced  beyond  a  certain  point  (approximately  2000 
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bits/second  for  continuous  speech) .  The  new  method  degrades  more 
slowly,  and  continues  to  degrade  slowly  even  when  the  transmission 
rate  is  reduced  beyond  this  point.  Since  the  transmission  rate  and 
the  quality  of  the  speech  depend  on  the  specific  utterance,  the  bit 
rates  discussed  are  approximate  and  should  be  understood  to  be  only 


QUALITY 


Fig.  2.6  Speech  Quality  Comparison  of  Old  VFR  and  New  VFR 
Algorithms 


indicative  of  the  general  trend  shown  in  Figure  2.6. 
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2.2.4  Mixed  Source  Analysis  and  Synthesis 

During  the  past  year  we  have  added  mixed-source  analysis  and 
synthesis  to  the  real-time  vocoder. 

2. 2. 4.1  Motivation 

The  commonly  used  source  or  excitation  for  the  synthesizer  of 
narrc.vband  LPC  vocoders  is  the  result  of  an  idealized  model,  which 
is  either  a  sequence  of  pulses  separated  by  the  pitch  period  for 
voiced  sounds,  or  white  noise  for  fricated  (or  unvoiced)  sounds. 
The  major  deficiencies  of  this  model  are  two-fold;  (1)  Some  speech 
sounds,  e  .g . ,  voiced  fricatives  such  as  [z],  are  produced  using 
both  vocal  cord  vibrations  and  turbulent  noise  as  excitation  for 
the  vocal  tract;  (2)  Errors  in  the  binary  voiced/unvoiced  (V/UV) 
decision  are  readily  perceived  by  listeners  as  a  severe  degradation 
to  the  quality  of  the  synthesized  speech. 

The  new  source  model  that  we  have  added  to  the  real-time 
vocoder  has  both  the  voiced  (pulse)  source  and  noise  source;  it 
allows  for  selective  excitation  of  different  speech  frequency  bands 
by  different  sources.  The  spectrum  is  divided  into  a  low  frequency 
and  a  high  frequency  region,  with  the  pulse  source  exciting  the  low 
region  and  the  noise  source  exciting  the  high  region.  The  cut-off 
frequency  that  separates  the  two  regions  is  a  parameter  of  the 
model,  which  is  computed  and  transmitted  to  the  receiver.  The 
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cut-off  frequency  is  a  continuous  rather  than  a  binary  parameter, 
and  it  has  a  "soft-fail"  effect  on  perception  in  that  perception  is 
not  very  sensitive  to  small  changes  in  its  value.  (For  more  on  the 
mixed-source  model,  see  the  paper  in  Appendix  B.) 

2. 2. 4. 2  Details  of  the  Algorithm 

Mixed-source  analysis  begins  with  the  pitch  and  voicing 
estimates  obtained  from  the  input  speech  by  the  SIFT  pitch 
extraction  algorithm  in  the  real-time  vocoder.  The  input  data 
originally  used  by  SIFT  is  multiplied  by  a  Hamming  window  centered 
on  the  data,  whose  length  is  determined  by  the  pitch  estimate,  such 
that  the  window  contains  2  to  3.5  pitch  periods.  The  length  of  the 
window  is  important.  If  it  is  too  narrow,  the  harmonics  cannot  be 
resolved;  if  it  is  too  wide,  local  maxima  unrelated  to  the 
harmonics  can  occur. 

Once  the  data  is  suitably  windowed,  an  FFT  is  performed  to 
obtain  a  255-point  magnitude-squared  spectrum.  A  smaller  FFT  would 
yield  resolution  too  coarse  for  resolving  the  harmonics  of 
low-pitched  speech.  The  algorithm  then  finds  all  local  maxima, 
using  3-point  quadratic  interpolation  to  find  the  frequency  of  each 
peak.  The  harmonics  are  then  examined  to  determine  the  frequency 
at  which  the  harmonic  structure  disappears.  The  harmonic  structure 
may  disappear  over  a  portion  of  the  spectrum  and  then  reappear  at  a 
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1  higher  frequency.  The  algorithm  allows  for  this  possibility  and 

i 

chooses  the  cutoff  frequency  to  be  the  highest  frequency  at  which 

harmonic  structure  is  evident.  If  the  frame  is  unvoiced,  the 

;  cutoff  frequency  is  set  to  zero.  The  cutoff  frequency  is  quantized 

L. 

to  3  bits  and  transmitted  only  when  it  changes. 

The  synthesis  portion  of  the  algorithm  applies  a  low-pass 
filter,  with  cutoff  frequency  corresponding  to  the  transmitted 
[  cutoff,  to  the  pulse  excitation,  and  a  high-pass  filter,  with  the 

I  same  cutoff,  to  the  noise  excitation.  The  filtered  excitations  are 

added,  and  then  passed  through  the  standard  synthesis  all-pole 
filter. 


2. 2. 4. 3  Results 

The  mixed-source  analysis  and  synthesis  algorithm  requires  an 
additional  1057  (octal)  words  of  program  source  memory  over  the 
1430  words  required  by  the  standard  LPC  algorithm.  It  also 
requires  6.2  msec  of  execution  time. 

In  the  AP-120B  implementation,  the  algorithm  behaved  much  as 
expected.  During  sonorant  speech,  the  cutoff  value  was  usually  7, 
with  some  occurrences  of  6,  implying  fully  voiced.  During  voiced 
fricatives,  the  cutoff  value  dropped  to  between  2  and  4.  However, 
informal  listening  tests  showed  that  there  was  little  perceptual 
difference  between  the  speech  generated  by  the  standard  vocoder  and 
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the  vocoder  incorporating  the  mixed-source  model  modifications.  We 
believe  that  the  expense  (in  terms  of  execution  time  and  program 
source  memory)  incurred  by  the  new  algorithm  is  not  justified  by 
the  perceptual  improvement  obtained. 

Our  earlier  work  with  the  mixed-source  model  was  done  in  the 
context  of  a  vocoder  with  a  5  kHz  bandwidth.  The  real-time  vocoder 
has  a  3.3  kHz  bandwidth.  We  believe  that  the  mixed-source  model  is 
most  effective  in  improving  quality  in  the  frequency  region  that  is 
missing  in  the  real-time  vocoder.  If  the  bandwidth  used  by  the 
real-time  vocoder  is  ever  increased,  we  believe  that  the 
mixed-source  model  could  have  a  positive  effect  on  the  resultant 
speech  quality. 


( 
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3.  ANALYSIS  AND  CODING 

During  the  past  year  we  investigated  alternative  techniques 
for  estimating  and  coding  the  spectral  parameters  used  in  the  LPC 
vocoder.  Section  3.1  presents  our  work  in  trying  to  improve  the 
estimation  of  LPC  parameters  using  adaptive  lattice  and 
autocorrelation  methods.  Section  3.2  describes  our  efforts  at 
spectral  coding  of  LPC  parameters. 

3.1  Adaptive  LPC  Analysis 

In  adaptive  analysis,  the  spectral  parameters  are  updated 
every  sample.  Sets  of  these  parameters  are  then  selected  for 
transmission.  Because  of  the  parallel  repetitive  nature  of 
adaptive  analysis  algorithms,  they  are  well  suited  for 
implementation  in  hardware. 

We  have  investigated  two  such  adaptive  algorithms;  the 
adaptive  lattice  method  described  by  Makhoul  and  Viswanathan  in  [2] 
and  in  Appendix  A,  and  the  adaptive  autocorrelation  method 
described  by  Barnwell  [3]. 

3.1.1  Adaptive  Lattice  Analysis 

Fig.  3.1  shows  the  basic  lattice  structure  that  is  used  in  LPC 
analysis.  x(n)  is  the  input  speech  signal,  fn,(n)  is  the  "forward" 
residual  at  stage  m,  gij,(n)  is  the  "backward"  residual,  and  is 
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the  reflection  coefficient.  Let  the  forward  transfer  function  from 
the  input  to  the  output  be 

P  _k  ,,  , 

A^(z)  =  1  +  E  a  z  (1) 

P  k=l  ^ 

where  p  is  the  number  of  stages  in  the  lattice.  Then  1/Ap(2)  is 
the  all-pole  filter  used  in  the  speech  synthesis  at  the  receiver. 
For  the  all-pole  filter  to  be  stable  one  can  show  that  the 
reflection  coefficients  must  obey 

•Kml'  l<m<p.  (2) 

In  a  situation  such  as  speech,  the  signal  x(n)  is 
nonstationary,  and  therefore  the  filter  coefficients  Kj„  must  vary 
as  a  function  of  time.  Prom  Fig.  3.1,  the  following  time-varying 
relations  hold 


fg  (n)  s  g0  (n)  =  x(n)  (3a) 

fm(")  =  fm-lf'')  +  Km(")gm-l("-l)  (3b) 

g„(n)  =  K„(n)f„_l(n)  +g^_i(n-l)  (3c) 

where  we  have  shown  the  explicit  dependence  of  on  time  as  Kjj,(n)  . 
Assuming  that  we  know  all  the  quantities  in  (3)  at  time  n,  we  need 
to  compute  the  value  of  K^(n+1)  at  time  n+1. 

The  computation  of  is  based  on  the  minimization  of  a 

mean-square  type  of  error  of  the  form: 
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2 

E_(n)  =  ^  w(n-k)  ejjj(k) 

k=-* 


m 


(4) 


where  ejj(k)  is  a  function  of  the  forward  and  backward  residual 
energies  given  by 

e2(k)  =  (1-Y)fm(k)+Ygj{k)  ,  0<y<1  (5) 

and  w(n)  is  a  weighting  sequence,  or  window,  that  weights  the 
residual  energy  into  the  past.  For  an  adaptive  situation,  one 
designs  w(n)  such  that  the  more  recent  values  are  given  greater 
importance.  The  constant  y  determines  the  mix  between  the  forward 
and  backward  residuals.  The  value  of  is  obtained  by  minimizing 
(4)  with  respect  to  Kjp.  The  result  is 


K  (n+1)  =  - 
m 


Z  w(n-k)f^_^(k)g^_^(k-l) 
k=-<”  _ _ 

Z  w(n-k)  [Y4_i(k)  +  (1-Y)gm_i(k-1)] 
k=-® 


C„(n) 

m 


(6a) 


"OnT 

m 


(6b) 


Theoretically  K^(n+1)  in  (6)  is  guaranteed  to  obey  (2)  only  for 
Y=0.5.  However,  in  practice,  a  wider  range  of  y  can  be  used 
without  violating  (2). 

Windowi  ng 

The  window  w(n)  must  have  the  property 


w(n)  =  0,  n<0 


(7) 
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so  that  only  past  values  of  the  forward  and  backward  residuals  are 
used  in  computing  (6).  Therefore,  from  (6),  we  see  that  w(n)  can 
be  considered  as  the  impulse  response  of  a  causal  filter.  If  the 
filter  is  recursive  and  of  finite  order,  the  numerator  and 
denominator  can  be  computed  recursively. 

In  our  experiments  we  used  real-pole  filters  of  the  form 

(l-a  2  ) 

where  we  have  used  the  z  transform  notation.  These  are 
multiple-pole  filters  determined  by  two  parameters;  N,  the  order 
of  the  filter,  and  a^,  the  pole  location.  As  an  example,  assume 
N=l,  then  the  numerator  and  denominator  in  (6b)  can  be  computed 
recursively  from 

C„(n)  =  {9a) 

Dn,(n)  =  a2D^(n-l)+[Yf2_i  (n)  +  (l-Y)g2_^  (n-1)]  (9b) 

Procedure 

The  procedure  for  computing  adaptively  is  as  follows.  At 
time  n,  we  assume  to  have  in  computer  memory  the  following 
quanti ties 
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Km(r')  >  l£m<p 

gm(n-l),  l<m<p  (10) 

'  D^(n-i);  l<m<p,  l<i<N 
x(n) 

where  p  is  the  order  of  the  LPC  filter  in  (1)  and  N  is  the  order  of 
the  window  filter  in  (8).  The  procedure  then  is: 


1 . 

m-^1 

1 

2. 

Compute  the  residual  values 
(10)  . 

fn,(n)  and 

gn,(n)  using  (3) 

and 

3. 

Compute  C^(n)  and  Djj,(n)  recursively, 
example) 

(Use  (9)  for  N=1 , 

for 

4  . 

Compute  K^(n+1)  from  (6b). 

5. 

m-4— m+l 

6. 

If  m>p,  exist;  otherwise,  go 

to  step  2. 

The  experimental  results  will 

be  given 

in  Section  3.1.4. 

3.1.2  Adaptive  Autocorrelation  Analysis 

In  this  method  [3]  ,  the  short-term  autocorrelation 
coefficients  of  the  signal  x(n)  are  computed  recursively  from: 

n 

Rjjj(n)  =  Z  Wjj^(n-k)x(n)x(n-m)  ,  Osm^p  (11) 

k=— «> 

where  Rni(f')  is  the  mth  autocorrelation  coefficient  at  time  n  and 
Wj„(n)  is  a  window  that  weights  the  lagged  products  x(n)x(n-m)  into 
the  past.  The  autocorrelations  can  then  be  used  to  compute  the  LPC 
coefficients  at  any  time  n  using  the  normal  equations 
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Note  the  difference  between  the  windows  here  and  those  used  for  the 
lattice  in  (8).  The  window  for  the  lattice  is  the  same  for  all 
lattice  stages.  Here,  the  window  is  different  for  different 
autocorrelation  coefficients;  that  is  why  subscripted  to 
indicate  its  dependence  on  the  autocorrelation  index. 

3.1.3  Selection  of  Parameters  for  Transmission 

Because  a  new  set  of  parameters  is  computed  for  each  input 
sample,  some  method  must  be  employed  to  select  the  sets  to  be 
transmitted  in  order  to  minimize  the  data  rate.  We  have 
investigated  three  such  methods. 
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I  The  first  and  simplest  method  is  to  send  every  Mth  set,  where 

M  corresponds  to  the  number  of  samples  in  the  interframe 
j  transmission  interval.  This  is  equivalent  to  "sampling"  the 

parameters  once  each  frame. 

I 

The  second  method  involves  low-pass  filtering  and  downsampling 
I  the  discrete-time  signals  composed  of  the  values  of  individual 

parameters.  Since  this  method  showed  no  improvement  over  the 
sampling  method  and  was  much  more  expensive  computationally ,  we 
postponed  more  detailed  investigation  of  it. 

The  third  method  attanpted  to  find  the  "best"  set  of 
parameters  in  a  frame.  Since  the  values  of  the  parameters  vary 
with  the  pitch  pulse,  it  is  desirable  to  use  the  set  of  parameters 
corresponding  to  a  constant  time  interval  relative  to  the  pitch 
pulse.  We  attempted  to  find  these  sets  of  parameters  by 
identifying  the  instant  of  time  during  the  frame  when  Vp,  the 
normalized  error,  was  minimum.  The  simple  minimum  proved  to  be 
inadequate,  however.  The  resultant  speech  was  judged  to  be 
"wobbly"  indicating  rapid  variations  in  the  parameters  from  frame 
to  frame. 

3.1.4  Experimental  Results 

In  both  the  lattice  and  autocorrelation  methods,  we 
"transmitted"  nine  reflection  coefficients  once  every  9.6  ms,  with 
no  quantization. 
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There  is  one  parameter,  y  »  in  the  lattice  method  that  does  not 
exist  in  the  autocorrelation  method.  We  tried  three  values  of 
y:  0,  0.5,  1.  y  =0  corresponds  to  minimizing  the  forward  residual 

energy;  y =1  corresponds  to  minimizing  the  backward  residual 
energy;  and  y=0.5  corresponds  to  minimizing  the  average  of  the 
two.  In  our  experiments  we  found  that  y  =  l  gave  the  best  qual ity, 
though  the  differences  between  the  three  cases  were  small.  The 
results  below  apply  to  the  case  where  y=l. 

The  only  variables  that  remain  are  the  window  parameters :a 
and  the  window  order.  Two  window  orders  were  used:  1-pole  and 
3-pole  windows  {N=l  and  3  in  (8)  for  the  lattice  method,  and  the 
windows  (13)  and  (14)  for  the  autocorrelation  method).  For  each  of 
the  two  window  orders  the  value  of  a  was  varied  to  optimize  the 
speech  quality. 

As  the  value  of  ot  was  increased,  the  speech  quality  changed 
from  being  crisp  but  rough  and  wobbly,  to  smooth  but  reverberant 
and  muffled.  For  each  case,  there  was  a  value  of  a  for  which  these 
effects  were  minimal  and  the  speech  quality  was  judged  to  be  the 
highest.  The  adaptive  lattice  and  adaptive  autocorrelation  methods 
gave  the  same  speech  quality  for  a  certain  window  order  and  a 
certain  a  .  Fig.  3.2  shovs  the  optimal  ranges  of  ot  for  which  the 
speech  quality  was  judged  to  be  the  best.  Also  shown  are  the 
preferred  values  of  ot  .  For  the  1-pole  cases,  the  optimal  range  for 
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Window  Optimal  a  Range 

1— pole  0.992—0.994 

3— pole  0.975—0.980 


Preferred  a  Value 

0.993 

0.978 


Fig.  3.2  Optimal  ranges  for  a  for  the  1-pole  and  the 
3-pole  windows  for  both  the  lattice  and 
autocorrelation  methods. 


3-pole 


a  = 

.972 

.975 

.978 

a=.992 

19 

23 

24 

1-pole  .993 

20 

22 

23 

.994 

22 

22 

24 

6  sentences,  4  listeners 

maximum  score  in  each  bin  =  6x4  =  24 


Fig.  3.3  Number  of  times  3-pole  method  preferred  over 
1-pole  method  for  three  preferred  values  of 
a  for  each  method. 
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I 

I  a  was  0.992-0.994,  with  0.993  being  preferred.  For  the  3-pole 

I 

cases,  the  optimal  range  fora  was  0.975-0.980,  with  0.978  being 
preferred . 

t 

i_  In  the  optimal  ranges  shown  in  Fig.  3.2,  the  3-pole  windows 

almost  always  produced  higher  speech  quality  than  1-pole  windows. 
Fig.  3.3  shows  the  number  of  times  the  3-pole  window  was  preferred 
over  the  1-pole  window,  using  the  lattice  method  with  6  sentences 
from  male  and  female  speakers.  The  scores  were  obtained  from  four 
listeners.  Therefore,  we  recommend  the  use  of  3-pole  windows  over 
1-pole  windows. 

When  compared  to  the  standard,  finite-window  LPC,  the  adaptive 
methods  with  3-pole  windows  gave  slightly  better  quality, 
especially  during  sonorant  regions. 

3.2  Spectral  Coding  of  LPC  Parameters 

It  is  well  known  that  channel  vocoders  possess  frequency 
specificity,  in  that  the  channels  are  independent  and  the 
quantization  of  one  channel  does  not  affect  other  channels.  This 
property  is  especially  important  in  the  presence  of  acoustic  noise. 
During  the  past  year,  we  have  attempted  to  develop  a  type  of  coding 
for  LPC  parameters  that  we  hoped  would  have  similar  properties. 
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We  argued  in  our  previous  work  that  the  reflection 
coefficients  and  the  poles  are  the  only  two  sets  of  parameters 
(from  a  large  set  that  we  investigated)  that  guarantee  filter 
stability  under  quantization.  We  chose  the  reflection  coefficients 
for  transmission  because  they  are  naturally  ordered,  and  the  poles 
are  not.  (Natural  ordering  is  very  important  in  taking  statistics 
and  using  them  to  reduce  the  transmission  rate  by  proper  encoding.) 

However,  the  poles  continue  to  have  one  advantage  that  is  not 
shared  by  other  parameters,  namely  that  the  poles  possess  frequency 
specificity.  This  is  important  for  two  reasons;  1)  The  ear  is 
especially  sensitive  to  formant  positions,  which  have  largely 
one-to-one  correspondence  with  pole  positions,  and  2)  The 
sensitivity  of  the  ear  is  frequency  selective,  namely,  it  is  more 
sensitive  to  low  frequencies  than  to  high  frequencies.  Therefore, 
had  it  not  been  for  the  natural  ordering  problem,  the  poles  would 
have  been  prime  candidates  for  transmission. 

It  is  important  to  note  that  when  any  reflection  coefficient 
is  quantized,  the  result  is  a  perturbation  of  the  spectrum  at  all 
frequencies,  with  no  control  over  any  particular  regioa.  On  the 
other  hand,  if  a  pole  is  quantized,  the  effect  is  largely  seen  in 
the  region  of  the  formant  corresponding  to  that  pole;  the  positions 
of  other  formants  are  not  affected  at  all.  We  believe  that  this 
frequency  specificity  is  important  in  maintaining  high  speech 
quality. 
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We  implemented  a  new  spectral  coding  scheme  which  transforms 
the  set  of  p  poles  inside  the  unit  circle  to  a  set  of  p  poles  oji 
the  unit  circle.  In  this  manner,  the  new  poles  become 
automatically  ordered  by  frequency  (i.e.,  position  on  the  unit 
circle),  and  hence  one  can  collect  the  statistics  needed  for 
efficient  coding. 

The  new  poles  are  actually  computed  from  two  polynomials  that 
are  easily  obtained  form  the  original  LPC  polynomial  [4].  Let  the 
LPC  polynomial  of  order  p  be  given  by  (1).  The  LPC  polynomial  of 
order  p+1  is  given  by 

Ap+i(z)  =  Ap(z)+Kp+iZ"^P'^^’Ap{z~M  (15) 
where  is  to  be  specified.  The  method  consists  of  setting  Kp^j^ 
to  either  +1  or  -1,  resulting  in  two  polynomials: 

aJ+i(z)  =  Ap(z)+z'<P+^)Ap(z"^)  (16) 
Ap+i(z)  =  Ap(z)-z-fP‘^^)Ap(z"^)  .  (17) 

The  coefficients  of  A'''(z)  and  A”(z)  are  thus  obtained  from  the 
coefficients  of  Ap(z)  by  simple  additions  and  subtractions.  The 
two  new  polynomials  are  guaranteed  to  have  their  poles  interleaved 
on  the  unit  circle.  (The  actual  implementation  involves  the  root 
computation  for  two  transformed  polynomials,  each  having  p/2  roots 
on  the  real  line  between  -1  and  1,  which  is  much  simpler  than 
computing  the  p+1  complex  roots  of  each  of  the  two  polynomials.) 
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Figure  3.4  shows  an  example  where  the  crosses  on  the  unit  circle 
represent  the  roots  of  A'''(z),  the  circles  represent  the  roots  of 
A~(z) ,  and  the  roots  of  Ap(z)  are  inside  the  unit  circle.  (The 
crosses  and  circles  on  the  real  line  are  the  projections  of  those 
on  the  unit  circle.) 

At  the  synthesis  end,  Ap{z)  is  computed  from  (3)  and  (4)  by 
simple  addition: 

Ap(z)  =  [aJ+1  (z)+Ap+j^  (z)  ]/2.  (18) 

Unfortunately,  we  found  that  the  spectral  coding  technique  did 
not  have  the  desired  frequency  specificity  property.  A 
quantization  error  in  any  of  the  poles  on  the  unit  circle  affected 
the  locations  of  all  of  the  transformed  poles  inside  the  unit 
circle.  The  general  result  was  that  this  spectral  coding  method 
did  not  result  in  substantially  improved  quality  over  current 
coding  techniques. 
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-1  Q  +1 


Fig*  3.4  Pole  distribution  in  the  z-plane  for  a  single  LPC  all-pole 
spectrum  with  p=12.  The  actual  poles  are  inside  the  unit 
circle.  The  result  of  the  proposed  pole  transformation  is 
shown  on  the  unit  circle,  with  the  crosses  showing  the 
roots  of  (z)  and  the  circles  the  roots  of  A“(z).  The 
projections  of  those  roots  on  the  real  line  are  also 
shown. 
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4.  SOURCE  MODEL  FOR  RESIDUAL -EXCITED  CODERS 

In  this  section  we  report  on  our  work  on  the  residual-excited 
baseband  coder.  Most  of  the  effort  has  gone  into  developing  a 
source  model  for  use  as  excitation  to  the  synthesis  filter  of  a 
baseband  coder.  In  the  following  subsections  we  describe  the 
baseband  coder  and  point  out  the  need  for  source  modeling.  We  then 
summarize  our  research  in  that  area. 

4.1  A  General  Residual-Excited  Baseband  Coder 


In  a  baseband  coder,  a  low-frequency  portion  of  the  signal, 
known  as  a  baseband,  is  quantized  and  transmitted.  Figure  4.1 
shows  the  block  diagram  of  a  residual-excited  baseband  coder.  The 
system  is  based  on  a  linear  prediction  representation  of  the  speech 
signal.  In  Fig.  4.1,  the  block  labelled  "baseband  extraction"  has 
two  functions.  The  first  is  to  filter  the  input  speech  signal  by 
means  of  the  LPC  inverse  filter.  The  second  function  is  to  lowpass 
filter  and  downsample  the  residual  waveform  to  retain  its  low 
frequency  components.  As  shown  in  the  figure,  a  major  task  at  the 
receiver  is  to  regenerate  the  missing  high-frequency  components. 
The  output  of  the  high-frequency  regeneration  box  is  an  excitation 
signal  with  a  flat  spectrum.  In  the  absence  of  quantization  noise, 
the  low  frequency  portion  of  the  excitation  signal  is  identical  to 
the  baseband  residual  while  the  high-frequency  portion  is  the 
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result  of  modeling  the  original  residual  spectrum.  The  fullband 
excitation  signal  is  used  as  input  to  the  all-pole  LPC  synthesis 
filter  to  generate  the  output  speech. 

The  quality  of  the  output  speech  signal  is  determined  by  four 
factors:  a)  width  of  the  baseband,  b)  coding  of  the  baseband, 

c)  estimation  and  coding  of  spectral  parameters,  and  d)  the 
high-frequency  regeneration  (HFR)  method  employed.  In  our  work  to 
date  we  have  concentrated  mainly  on  the  fourth  factor,  HFR.  The 
HFR  method  will  be  explained  below.  As  for  the  coding  of  the 
baseband — an  important  consideration  for  the  design  of  the 
coder--we  have  thus  far  chosen  a  simple  approach,  that  of  APCM  of 
the  baseband  residual  together  with  entropy  coding  to  maintain  the 
total  average  bit  rate  at  or  below  9.6  kb/s. 

4.2  High-Frequency  Regeneration 

In  the  second  quarterly  progress  report  (QPR)  and  in  a 
recently  published  paper,  included  here  as  Appendix  C,  we  have 
presented  new  methods  of  high-frequency  regeneration  for  baseband 
coders.  The  idea  behind  the  new  methods  derives  from  the 
pitch-excited  LPC  vocoder.  In  voiced  excitation,  the  spectrum  of 
the  excitation  is  a  flat  line  spectrum  at  multiples  of  the 
fundamental  pitch  frequency.  Such  a  spectral  structure  is  periodic 
and  repetitive:  the  high-frequency  structure  is  the  same  as  at  low 
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frequencies.  The  spectrum  of  unvoiced  excitation,  on  the  other  ' 

hand,  is  continuous  and  has  a  random  spectrum  with  a  flat  envelope. 

However,  the  details  of  the  unvoiced  spectrum  are  not  as  i 

perceptually  important  as  the  details  of  the  voiced  spectrum.  | 

Therefore,  the  unvoiced  spectrum  can  be  considered  repetitive  also, 
in  that  any  similar  spectrum  can  be  substituted  with  equally  good  1 

results.  ' 

P 

The  new  regeneration  method,  then,  is  simply  to  duplicate  the  I 

baseband  spectrum  at  higher  frequencies,  in  some  fashion.  We  j 

discussed  two  types  of  HFR:  spectral  folding  and  spectral 

translation.  Figure  4.2  illustrates  the  effect  of  the  two  types  of 
HFR  for  the  3-band  case,  with  a  baseband  width  of  B  Hz.  In 
spectral  translation,  the  frequency  components  between  nB  and 
(n+l)B  are  simply  a  highpass  translated  version  of  the  frequency 
components  between  0  and  B,  for  all  n.  Such  is  also  the  case  for 
spectral  folding  but  only  for  n  even.  In  spectral  folding  and  for 
n  odd,  the  frequency  region  between  nB  and  (n+l)B  is  the  mirror 
image  of  the  baseband. 

4.2.1  Time-Domain  HFR 

We  now  describe  the  time-domain  implementation  of  these  j 

methods.  Briefly,  integer-band  spectral  duplication  is  the  case  j 

where  the  baseband  width,  B  Hz,  is  adjusted  such  that  the  total  1 

i 
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Fig.  (a)  Baseband  spectrur. 

(b)  3-band  spectral  folding 

(c)  3-band  spectral  translation. 
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signal  bandwidth,  W  Hz,  is  an  integer  multiple  of  B,  i.e.,  W=LB, 
with  L  an  integer  (See  Fig.  ^-.2).  The  time  domain  implementation 
ot  integer-band  HFR  by  spectral  folding  is  simply  done  by  inserting 
L-I  zeros  between  consecutive  time-domain  samples  of  the  baseband 
residual.  Spectral  translation  is  implemented  in  a  similar 
fashion,  but  it  requires  further  filtering.  For  details  on  both 
methods,  the  reader  is  referred  to  Appendix  C  and  to  the  second 
QPR.  Spectral  folding  and  translation  are  attractive  because,  in 
general,  they  are  computationally  less  expensive  than  the 
traditional  waveform  recti f ication  methods .  Waveform  rectification 
has  been  the  most  commonly  used  HFR  method  in  baseband  coders. 

In  preliminary  experiments  using  HFR  by  spectral  folding  or 
spectral  translation  we  heard  a  number  of  distortions  in  the  form 
of  added  tones.  These  tones  were  generally  more  audible  for  a 
smaller  baseband  width  and  for  higher-pitched  voices.  For  the 
spectral  folding  case,  we  were  able  to  verify  the  existence  of  a 
tone  at  even  multiples  of  the  folding  frequency,  i.e.  at  multiples 
of  2B  Hz.  The  tone  was  largely  eliminated  by  a  simple  method:  we 
subtracted  off  the  short-term  d.c.  in  the  baseband  residual, 
because  the  d.c.  is  folded  into  multiples  of  2B  Hz.  Following 
spectral  folding,  we  restored  the  original  d.c.  so  as  not  to 
disturb  the  average  signal  level,  but  the  energy  at  multiples  of  2B 
Hz  had  already  been  eliminated. 
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Other  audible  tones  have  been  more  difficult  to  trace. 
However,  HFR  by  inteqer-band  spectral  duplication  does  not  seem  to 
cause  any  perceptible  roughness,  as  was  the  case  in  rectification. 

One  reason  for  the  existence  of  these  background  tones  may  be 
the  fact  that,  with  spectral  duplication,  the  harmonic  structure  is 
interrupted  at  multiples  of  B  Hz.  We  hypothesized  that  the  tones 
may  be  eliminated  by  adjusting  the  width  B  of  the  baseband  to  be  a 
multiple  of  the  pitch  fundamental  frequency  on  a  short-term  basis. 
Unfortunately,  if  implemented  in  the  time  domain,  such  a  scheme 
would  require  an  enormous  amount  of  computation  which  would  offset 
the  initial  reduction  in  computation  afforded  by  the  spectral 
duplication  method.  We  have  therefore  decided  to  take  a  frequency 
domain  approach  that  would  allow  us  to  implement  p i tch- adapt  ive  HFR 
with  a  minimal  amount  of  computations.  This  is  explained  next. 

4.2,2  Frequency-Domain  HFR 

In  the  frequency  domain,  the  baseband  frequency  components  can 
be  easily  duplicated  at  higher  frequencies  to  obtain  the  frequency 
components  of  the  full-band  excitation  signal.  As  mentioned 
earlier,  in  our  work  to  date,  we  have  transmitted  the  baseband 
residual  waveform  using  APCM.  Thus,  at  the  receiver,  it  is 
necessary  to  perform  a  t  ime- to- f  r  equency  transformation  prior  to 
HFR.  Once  in  the  frequency  domain,  care  is  taken  not  to  interrupt 
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the  harmonic  structure  of  the  excitation  signal  by  using  an 
estimate  of  the  pitch.  Tn  case  the  coding  at  the  transmitter  makes 
use  of  pitch,  the  value  of  pitch  is  transmitted  and  is  readily 
available  at  the  receiver.  Otherwise,  the  receiver  can  easily 
extract  a  pitch  value  from  the  received  baseband,  e.g.,  by 
detecting  the  location  of  the  peak  of  the  autocorrelation  of  the 
baseband.  With  pitch  known,  the  receiver  extracts  a  subinterval  of 
the  baseband  containing  an  integer  number  of  harmonics.  The  chosen 
subinterval  is  duplicated  (translated)  at  higher  frequencies  as 
many  times  as  necessary  to  fill  the  missing  frequency  components. 
The  HFR  procedure  is  illustrated  in  Fig.  4.3. 

For  voiced  sounds,  we  found  that  good  perceptual  results  are 
obtained  when  the  subinterval  extends  from  the  spectral  valley  just 
before  the  first  harmonic  to  the  valley  just  after  the  last 
complete  harmonic  present  in  the  baseband  (see  Fig.  4.3).  The 
upper  frequency  edge  of  the  subinterval  is  C  Hz  and  is 
pitch-dependent.  For  unvoiced  sounds,  the  subinterval  consists  of 
the  whole  baseband,  less  its  two  end  points:  the  d.c.  component  and 
the  component  at  B  Hz.  Following  the  HFR  process,  a 
f requency-to- time  transformation  yields  the  full-band  time-domain 
excitation  signal  to  be  applied  to  the  synthesis  filter  1/A(z).  We 
note  here  that  the  effective  baseband  width  is  C<B.  The  received 
frequency  components  between  C  and  B  are  discarded,  and  those 
between  C  and  W  are  regenerated  (See  Fig.  4.3). 
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One  possible  extension  o£  the  above  described  HFR  method  is  to 
let  the  transmitter  locate  and  extract  the  subinterval  of  the 
baseband  to  be  used  in  the  HFR  process.  In  such  a  case  the 
transmitted  baseband  would  be  pi tch-adapti ve  and  of  width  C  Hz. 

A  second,  and  perceptually  more  important  extension  of  the 
method,  provides  for  a  better  placement  of  the  frequency-translated 
intervals.  In  the  above  described  method  of  HFR,  we  assumed  that 
the  spectrum  of  voiced  sounds  is  periodic  in  frequency.  However, 
we  know  that  speech  spectra  are  not  exactly  harmonic  in  structure. 
To  take  into  account  the  irregularities  of  the  speech  spectrum,  we 
shift  the  high-frequency  interval  around  its  nominal  position  in 
such  a  manner  as  to  match,  as  best  as  possible,  the  corresponding 
original  frequency  components  of  the  full-band  residual.  This  task 
is  done  at  the  transmitter,  vihere  the  full-band  excitation  spectrum 
is  available.  First,  the  chosen  subinterval  is  translated  to  its 
nominal  high-frequency  position.  Then,  it  is  cross-correlated  with 
the  corresponding  frequency  components  of  the  full-band  residual. 
The  "optimal”  location  is  then  chosen  to  be  at  the  positive  maximum 
value  of  the  cross-correlation.  When  short  correlation  lags  are 
considered,  i.e.,  between  and  3,  the  additional  cost  is  only  2 
bits  for  each  f requency-translated  interval.  The  transmitted 
information  indicates  to  the  receiver  where  to  place  the  baseband 
subinterval  in  the  high  frequency  region,  relative  to  its  nominal 
po  s  i  t  i  o  n  . 
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4 , 3  Results 

We  have  implemented  a  preliminary  version  of  the  proposed 
baseband  coder  and  simulated  its  operation  at  a  transmission  rate 
of  9.5  kb/s.  For  inputs  at  a  sampling  rate  of  5.57  kHz,  we  have 
chosen  the  frame  size  to  be  19.2  ms,  i.e.,  125  samples.  In  our 
experiments  thus  far,  we  transmitted  the  baseband  residual  using 
time-domain  APCM  with  entropy  coding.  The  system  parameters  (LPC 
and  gain)  are  transmitted  separately  at  the  rate  of  2.3  kb/s, 
leaving  about  7.3  kb/s  for  the  baseband  residual. 

In  our  initial  experiments,  we  implemented  frequency-domain 
integer-band  spectral  folding  and  spectral  translation.  At  the 
receiver,  we  tried  both  the  discrete  Fourier  transform  (DFT)  and 
the  discrete  cosine  transform  (DCT) .  We  found  the  frequency-domain 
results  to  be  perceptually  similar  to  the  time-domain  results,  with 
low-level  tones  and  no  roughness  in  the  background.  In  more  recent 
experiments,  we  performed  pitch-adaptive  HFR  with  the  added 
cross-correlation  feature  for  better  spectral  duplication  (using 
only  the  DCT) .  Upon  informal  listening,  we  found  that  this  system 
provides  a  marked  improvement  in  speech  quality  over  the 
traditional  waveform  rectification  approach  and  over  the 
non-pi tch-adapt  ive  time-domain  spectral  duplication  methods. 
However,  the  method  introduces  some  roughness  reminiscent  of 
v/aveform  rectification. 
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We  expect  the  quality  of  the  synthetic  speech  to  improve 
further  when  we  change  the  coding  of  the  baseband  from  APCM  to 
adaptive  transform  coding  (ATC),  The  idea  here  is  to  perform  the 
requisite  time- to- f requeue y  transformation  at  the  transmitter  prior 
to  coding.  Our  expectations  of  improvement  in  quality  are  based  on 
the  fact  that  ATC  provides  an  increase  in  SNR  over  APCM  and  lends 
itself  mors  readily  to  spectral  noise  shaping  which  is  essential  to 
minimize  the  perception  of  quantization  noise.  We  have  chosen  the 
nCT  because  it  fits  in  with  existing  ATC  schemes.  In  fact,  we 
began  the  implementation  of  ATC  of  the  baseband  residual  and  w 
shall  discuss  it  in  detail  in  the  next  quarterly  progress  report 
since  work  is  still  on-going  at  this  point. 

In  our  future  work,  we  shall  look  into  the  possibility  of 
having  a  multi  rate  coding  system.  The  system  bit  rate  can  be 
vatied  by  varying  the  width  of  the  baseband.  Alternatively,  the 
transmitter  may  be  operating  at  a  fixed  rate,  but  the  transmission 
channel  will  be  allowed  to  discard  some  of  the  bits.  The  bits 
(codes)  are  usually  arranged  to  correspond  to  the  transmitted 
frequency  components,  in  an  ascending  order,  going  from  low  to  high 
frequencies.  Thus,  the  discarded  codes  correspond  to 
high-frequency  components.  The  receiver  will  regenerate  the 
missing  high-frequency  components,  using  the  HFR  methods  described 
above,  irrespective  of  the  actual  channel  rate. 
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5.  PHONETIC  SYNTHESIS 

]  In  t  rod  uc t  ion 

Our  phonetic  synthesis  development  this  year  is  an  initial 
step  in  the  development  of  a  very  low  rate  (approx.  1(?9 

bits/second)  speech  transmission  system  (51.  An  overview  block 
diagram  of  the  very  low  rate  (VLB)  transmission  system  is  shown  in 
Figure  5.1.  In  order  to  achieve  such  a  low  bit  rate,  the  VLR 
vocoder  models  the  speech  in  terms  of  phoneme-sized  units.  The 

analyzer  in  such  a  vocoder  would  extract  from  a  spoken  sentence  a 
sequence  of  triplets.  Each  triplet  consists  of  a  phoneme,  a 
phoneme  duration,  and  a  single  pitch  value.  In  our  synthesis  work, 
this  sequence  of  triplets  is  determined  from  a  "target"  sentence  by 
a  human  transcriber.  The  output  of  the  phonetic  synthesis  program 
is  the  complete  set  of  "synthesis  parameters"  required  by  an  LPC 
synthesizer.  They  are,  for  each  l(i  ms,  14  LPC  parameters 

specifying  a  spectrum,  a  value  of  gain,  a  voicing  flag,  and  (if 
voiced)  a  value  of  pitch  and  cutoff  frequency  for  the  mixed-source 
model  fR,7].  Our  goal  this  year  has  been  to  synthesize  very 

natural  sounding  speech  using  only  the  VLR  phonetic  input.  Tn 
addition  to  being  part  of  a  phonetic  vocoder,  this  phonetic 
synthesis  program  will  also  be  useful  in  a  tex t-to-speech  system, 
or  as  part  of  a  speech  storage  and  playback  system  requiring  very 
little  storage. 
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5.2  Design  choices 

5.2.1  Diphone  templates 

The  basic  method  for  phonetic  synthesis  that  we  have  chosen  is 
concatenation  of  diphone  templates.  A  diphone  is  defined  as  the 
region  from  the  middle  of  one  phoneme  to  the  middle  of  the  next 
phoneme.  Thus,  for  each  possible  pair  of  phonemes  there  is  one 

diphone.  A  diphone  template  consists  of  the  parameters  necessary  4 

to  synthesize  that  diphone. 

The  diphone  is  a  natural  unit  for  synthesis  because  the 
coa r t icul a  to ry  influence  of  one  phoneme  does  not  usually  extend 
much  further  than  half  way  into  the  next  phoneme.  Since  diphone 
junctures  are  usually  at  articulatory  steady  states,  minimal 
smoothing  is  required  between  adjacent  diphones.  Also,  since  the 
regions  around  the  phoneme  transitions  are  preserved  intact,  the 
difficult  task  of  duplicating  these  transitions  by  complicated 
acoustic-phonetic  rules  is  avoided.  We  estimate  that  approximately 
25ng  diphone  templates  are  needed  to  achieve  high  quality. 

5.2.2  LPC  synthesis 

We  have  chosen  to  use  LPC  synthesis  because  of  our  extensive 
experience  with  LPC  analysis/synthesis  systems.  Consequently  the 
diphone  templates  contain  parameters  necessary  for  an  LPC  vocoder. 

t 

For  every  If  ms  frame  in  a  diphone  template,  we  store  a  set  of  14 
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Log  Area  Ratio  (LAR)  parameters  and  a  value  of  energy.  (LAR 
parameters  are  stored  instead  of  LPC  parameters  due  to  their  better 
behavior  under  quantization  and  interpolation.) 

5.2.3  Real  Speech 

A  third  design  choice  in  this  project  has  been  to  use  diphone 
templates  that  have  been  extracted  from  reax  speech.  It  is  felt 
that  this  will  help  in  assuring  that  the  pronunciations  that  result 
will  likewise  be  natural.  The  data  base  from  which  diphone 
templates  are  extracted  is  described  in  detail  in  Section  5.5. 

5.3  Ov  e  rv iew 

A  major  portion  of  this  project  consists  of  gathering  the  set 
of  diphone  templates  to  be  used  in  synthesis.  The  diphone 
templates  are  being  extracted  from  a  carefully  designed  data  base 
of  short  utterances  spoken  by  a  single  speaker  in  a  quiet 
environment.  After  each  utterance  has  been  digitized,  the 
templates  are  specified  by  a  researcher  'who  indicates  the 
appropriate  phoneme  boundary  time  markers  associated  with  each 
short  utterance  . 

Figure  5.2  illustrates  the  synthesis  procedure.  The  speech 
synthesis  program 
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INPUT 


Fig.  5.2  Synthesis 
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1)  translates  the  input  phoneme  sequence  into  a  diphone  sequence; 

?)  selects  the  most  appropriate  diphone  template  (depending  on  the 
local  phonetic  context); 

3)  time-warps  each  of  the  diphone  templates  to  produce  a  gain 
track  and  1^-  LAR  parameter  tracks  of  the  specified  durations; 

i)  smooths  between  adjacent  warped  diphone  templates  to  minimize 
gain  and  spectral  discontinuities; 

5)  reconstructs  continuous  pitch  tracks  by  linear  interpolation  of 
the  single  pitch  values  given; 

determines  the  cutoff  frequency  and  voicing  using  knov^ledge  of 
the  phoneme  being  synthesized; 

■^ )  converts  resulting  LAR  parameter  tracks  to  LPC  parameter 
tracks ; 

B)  uses  the  resulting  sequence  of  LPC,  pitch,  gain,  and  cutoff 
frequency  (specified  every  10  ms)  as  input  to  control  an  LPC 
speech  synthesizer. 
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'  I  '^.4  Algorithm  Details 

I  5.4.]  Extensions  to  Diphone  Definition 

l_  In  this  section,  we  describe  some  extensions  to  the 

I 

fundamental  definition  of  a  diphone  which  permit  the  highest 
j  possible  intelligibility  and  naturalness  to  be  attained.  These 

extensions  allow  "special  cases"  to  be  handled  uniformly  by  the 
synthesis  program. 

5. 4. 1.1  Context-Specific  Diphones 

r  Most  diphones  can  be  used  adequately  independent  of  context, 

but  there  are  important  exceptions.  For  instance,  in  the  sequence 
"■W-IH-L]  ,  as  in  the  word  "will",  the  part  of  the  phoneme  riH] 
contained  in  the  diphone  rw-IH]  is  drastically  affected  by  the 
I  presence  of  the  tl].  Consequently,  we  store  a  separate  diphone 

template  for  'w-IHI  to  be  used  in  this  context.  To  account  for 
such  phenomena  we  have  allowed  more  than  one  template  to  be  defined 
for  diphones  when  they  are  affected  by  context.  Additional  diphone 
templates  necessitated  by  lateral i zation ,  retroflexion  and  other 
strong  contextual  phenomena  are  expected  to  account  for  of 

the  total  diphone  inventory.  The  context-dependent  diphone 
templates  are  determinable  from  a  normal  phoneme  string. 
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5.4.1,?  Splitting  Phonemes 

Some  phonemes  that  have  more  than  one  acoustically  distinct 
region  have  been  split  up  into  two  "pseudophonemes".  This  permits 
us  to  control  the  durations  of  each  of  the  regions  independently. 
For  instance,  the  diphthong  TAYl  ,  in  "bite",  starts  out  much 
like  an  ]  ,  as  in  "pot",  but  ends  up  more  like  the  ’’lYl  in 
"beet".  The  two  relatively  steady  regions  are  connected  by  a  rapid 
transition  between  them.  Since  durations  of  the  two  steady  regions 
are  somewhat  independent,  as  are  the  contextual  effects  of 
neighboring  phonemes,  this  diphthong  has  been  split  into  two 
pseudophonemes,  rAYl-AY21,  which  appear  only  in  sequence.  The 
unvoiced  plosives  and  affricates  also  have  two  acoustically 
distinct  regions.  Each  region  is  treated  as  if  it  were  a  separate 
phoneme . 

5,4.2  Time  Warping 

In  order  to  provide  input  to  the  LPC  synthesis  program  we  must 
specify  LPC  coefficients  at  fixed  intervals  (1^  ms)  by  time  warping 
template  information  (whose  duration  is  fixed)  to  satisfy  phoneme 
durations  specified  by  the  input.  This  is  made  difficult  by  the 
fact  that  the  time  warping  must  preserve  the  naturalness  of  speech. 
One  way  to  do  this  is  to  treat  speech  as  being  made  up  of  elastic 
and  inelastic  regions. 
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The  principle  of  distinguishing  between  "elastic"  and 
"inelastic"  regions  of  the  template  arises  from  observations  of 
speech  parameters  under  widely  varying  speaking  rates.  Most  of  the 
durational  variation  is  observed  to  occur  during  the  "steady  state" 
portion  of  the  phoneme  (an  elastic  region),  whereas  the  transition 
portions  (inelastic  regions)  are  relatively  insensitive  to  changes 
in  speaking  rate.  Hence,  our  time  warping  algorithm  allows  us  to 
specify  that  a  certain  percentage  of  the  diphone  template  on  each 
side  of  the  phoneme  boundary  is  to  be  treated  as  relatively 
inelastic  and  the  rest  of  the  diphone  template  (the  section 
corresponding  to  phoneme  middles)  as  more  elastic. 

Time  warping  is  accomplished  by  the  use  of  piecewise  linear 
mapping  functions,  which  define  the  part  of  the  template  to  be  used 
at  each  instant  of  time.  This  correspondence  and  the  resulting 
mapping  function  are  illustrated  in  Figure  5.3.  The  speech  being 
synthesized  is  the  sequence  /-  DH  AX  M/  as  in  "The  man...".  The 
vertical  axis  represents  time  in  the  diphone  templates.  The 
horizontal  axis  represents  time  in  the  synthesized  speech.  The 
diphone  template  durations  are  determined  from  the  original 
recorded  short  utterances,  while  the  phoneme  durations  are 
determined  by  the  input  utterance  to  be  vocoded. 

The  phoneme  boundaries  in  the  diphone  templates  are  mapped 
onto  the  phoneme  boundaries  in  the  synthesized  speech  to  define 
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uniquely  a  point  in  the  piecewise  linear  mapping  function.  The 
diphone  boundary  is  mapped  onto  the  center  of  the  phonemes.  The 
DASHED  lines  connect  phoneme  boundaries,  and  the  DASH-DOT  lines 
connect  diphone  boundaries  (phoneme  centers) .  Notice  that  the 
templates  are  generally  compressed  during  the  mapping.  The  reason 
for  this  is  that  (whenever  possible)  our  diphone  templates  are 
extracted  from  fully  articulated  short  utterances.  We  deliberately 
designed  our  data  base  to  consist  of  fully  articulated  short 
utterances  because  we  believe  that  compressing  a  long  template  will 
result  in  better  speech  quality  than  expanding  a  short  one.  (We 
were  also  influenced  by  the  consideration  that  information  is  more 
easily  ignored  than  generated.) 

The  elastic  and  inelastic  regions  are  delineated  by  small  tic 
marks  on  both  axes.  The  knees  in  the  mapping  function  shown 
correspond  to  the  intersection  of  these  tic  marks.  Although  both 
elastic  and  inelastic  regions  of  the  template  (in  this  example)  are 
shortened  by  mapping  from  templates  to  phonemes,  the  inelastic 
regions  are  shortened  less. 

As  part  of  our  test  of  the  warping  algorithm,  we  tried  varying 
the  rate  of  the  synthesized  speech.  When  the  time  warping  was 
uniform,  some  phoneme  transitions  (e.g.,  vowel/nasal  transitions) 
became  slurred.  The  time  warping  relations  were  varied  such  that 
all  the  transitions  sounded  natural  over  a  wide  range  of  speech 
rates. 
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5.4.1  Parameter  Smoothing 

When  templates  are  concatenated,  discontinuities  in  the  gain 
and  the  LAR  parameters  may  result  despite  the  fact  that  the 
parameters  are  smooth  within  each  template.  In  order  to  deal  with 
this,  we  smooth  the  parameters  throughout  an  "interpolation  region" 
that  straddles  the  diphone  template  boundary  (phoneme  middle)  and 
contains  potential  parametric  discontinuities.  This  interpolation 
region  is  defined  by  two  "interpolation  points"  (one  from  each 
d i phone) . 

Figure  5.4  compares  two  different  parameter  smoothing 
algorithms  as  applied  to  a  single  parameter.  The  heavy  lines 
indicate  the  parameter  tracks  as  taken  from  the  two  diphone 
templates.  Taken  together  these  tracks  span  one  phoneme.  There  is 
a  discontinuity  at  the  diphone  boundary,  which  is  indicated  by  the 
vertical  DASHED  line.  The  two  vertical  DASH-DOT  lines  delineate 
the  interpolation  region.  The  straight  line  (a)  connecting  the  two 
parameter  tracks  illustrates  the  effect  of  linear  interpolation. 
As  can  be  seen  in  this  case  the  linear  interpolation  results  in  a 
poor  fit  to  the  original  data.  The  other  curve  ( b)  connecting  the 
two  points  is  derived  by  adding  a  ramp  (shown  at  the  bottom  of  the 
figure)  to  each  parameter  track,  such  that  the  discontinuity  is 
eliminated.  This  second  smoothing  method  often  preserves  the  shape 
of  the  original  parameter  tracks  better  than  linear  interpolation. 
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Mote  that  the  second  method  requires  that  the  parameter  tracks  be 
preserved  in  the  diphone  templates.  Tt  is  expected  that  ^  to  ^ 
frames  per  diphone  template  will  ba  sufficient. 

5.4.4  Gain  Adjustment 

The  energy  values  to  be  used  in  synthesis  are  stored  in  the 
diphone  templates.  In  some  cases  the  overall  intensity  of  a 
diphone  is  not  consistent  with  that  of  the  neighboring  diphones, 
due  to  differences  in  speaking  level  during  the  recording  of  the 
data  base.  This  inconsistency  is  reduced  by  the  optional 
specification  of  a  gain  adjustment  for  each  diphone  template  in  the 
data  base  . 

If  the  inconsistency  is  due  to  a  different  intonational  stress 
in  the  input  speech,  the  adjustment  to  the  energy  during  a  phoneme 
could  be  specified  by  a  stress  code. 

5.4.5  Excitation 

In  addition  to  14  LPC  parameters  and  gain,  the  LPC  synthesis 
program  requires  for  each  IT  ms  frame  a  value  of  pitch,  a  voicing 
flag,  and  a  cutoff  frequency. 
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(  5. 4.^. I  Pitch  Track 

I  The  pitch  track  during  voiced  phonemes  is  reconstructed  from 

the  input  pitch  values  by  straight  line  interpolation.  In  our 

I  simulation  of  the  analysis  part  of  a  phonetic  vocoder,  the  single 

transmitted  pitch  values  are  determined  from  complete  pitch  tracks 

i 

in  the  sentence  being  analyzed.  The  analysis  program  (given  the 

phoneme  identities  and  phoneme  boundaries)  determines  a  weighted  j 

piecewise  linear  least-squares  fit  to  that  pitch  track.  The 

endpoints  of  the  linear  sections  (which  occur  at  phoneme 

boundaries)  are  transmitted.  The  weighting  is  designed  to  minimize 

the  effect  of  pitch  tracker  errors.  It  was  found  that  sentences 

synthesized  using  these  piecewise  linear  pitch  tracks  are 

practically  indistinguishable  from  those  using  the  original 

analyzed  pitch  tracks. 

5 . 4 .  5 . 2  Vo  i  c  i  ng 

Voicing  was  determined  directly  from  the  identity  of  the 
phoneme  being  synthesized.  Voicing  errors  are  avoided  by  careful 
placement  of  phoneme  boundaries  in  the  diphone  templates.  We  have 
found  that  a  one  frame  error  in  placement  of  the  boundary  can  cause 
a  severe  "pop"  in  the  synthesized  speech,  due  to  misalignment  of 
spectral  parameters  with  excitation  parameters. 

1 

I 
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5. 4. 5.1  Mixed-Source  Model  -  Cutoff  Frequency 

Previous  work  [R,ci  has  shown  that  our  mixed-source  model  of 
excitation  results  in  more  natural  sounding  (less  buzzy)  speech  by 
allowing  for  specification  of  a  cutoff  frequency  with  every  value 
of  pitch.  The  voicing  excitation  is  low-pass  filtered  and  the 
frication  is  high-pass  filtered.  The  cutoff  frequency 
simultaneously  marks  the  upper  edge  of  the  voicing  spectrum  and  the 
lower  edge  of  the  frication  spectrum.  Currently  we  have 
implemented  an  algorithm  that  selects  a  cutoff  frequency  based  on 
distinctive  features  of  the  phoneme  being  synthesized.  For 


example , 

for  vowels  the 

cutoff 

frequency 

is  at  5  0  00 

Hz 

( fully 

VO  iced )  ; 

for  unvoiced 

consonants  it  is  ( 

’  Hz;  and 

fo  r 

VO i ced 

fricatives  (which  are 

produced 

with  both 

period  ic 

and 

random 

excitation)  it  is  1500  Hz.  The  cutoff  frequency  parameter  track  is 
then  low-pass  filtered  in  time  in  order  to  minimize  excitation 
discontinuities  at  phoneme  boundaries.  The  implementation  of  the 
cuto f f- frequency  algorithm  has  resulted  in  a  noticeable  improvement 
in  speech  quality;  in  particular,  a  decrease  in  buzziness. 
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5.5  Data  Base 

5.5.1  Design  of  the  Data  Base 

At  the  outset  of  this  project  we  knew  that  the  synthesis 

quality  would  depend,  to  a  large  degree,  on  the  nature  of  the  data 
base  of  speech  from  which  diphone  templates  were  to  be  extracted. 
After  several  experiments  with  different  data  bases,  we  designed  a 
data  base  that  seems  most  appropriate  for  this  project.  A 
significant  problem  with  earlier  data  bases  was  that  the  gain 
parameter  was  not  consistent  between  diphones  that  were  to  be 

abutted.  This  was  partially  due  to  the  fact  that  some  of  the 

diphones  were  taken  from  the  beginning  of  an  utterance,  while 

others  were  taken  from  the  middle  or  end  of  an  utterance.  In 

addition,  different  diphones  containing  the  same  vowel  were 

recorded  several  minutes  or  even  hours  apart,  and  incidental 
changes  in  speaking  level  and  mouth- to-m icrophone  distance  produced 
noticeable  differences  in  amplitude.  However,  there  was  no 

consistent  method  to  determine  whether  one  diphone  was  louder  than 
another  because  it  was  inherently  louder,  or  because  the  speaker 
just  happened  to  be  speaking  louder. 

Since  there  is  evidence  that  gain  was  a  major  problem  with  the 
synthesized  speech,  the  recording  procedure  was  designed  so  that 
incidental  differences  in  loudness  could  be  minimized.  The 
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utterances  were  reorganized  into  groups  according  to  the  vowel 
phoneme  of  each  diphone.  Thus,  the  different  diphones  involving 
one  particular  vowel  were  all  spoken  within  a  short  period  (roughly 
2  minutes).  This  close  proximity  helps  to  ensure  that  the  speaking 
level  was  roughly  constant  during  different  instances  of  the  same 
vowel  . 

The  diphone  utterances  consisted  of  short  nonsense  syllables 
which  were  repeated  three  times  as  in  connected  speech 
(e.g.,rpa  pa  pap]).  The  di phones  are  most  often  extracted  from  the 
middle  syllable,  which  tends  to  have  a  more  prototypical 
articulation  (and  loudness). 

In  addition  to  short-term  variations  in  speaking  level,  there 
is  also  a  long-term  variation  in  the  level  over  several  hours  and 
between  the  several  days  that  elapsed  during  the  recordings.  In 
order  to  estimate  this  effect,  each  group  of  utterances  was 
initiated  and  terminated  with  a  "normalization  utterance".  The 
normalization  utterance  we  chose  to  use  was  r  ci  *  d  as  d  ®  d  ]  .  This 
utterance  was  chosen,  because  the  vowel  las]  in  combination  with  a 
voiced  plosive  results  in  a  higher  amplitude  than  most  syllables. 
After  the  data  has  been  analyzed,  the  level  of  each  diphone  can  be 
set  relative  to  the  normalization  utterances,  thus  cancelling  out 
long-term  variations  in  speaking  level.  This  framework  also  allows 
for  the  level  adjustment  that  will  probably  be  necessary  for 
several  groups  of  diphones. 

-  7h  - 


■  V,’ 


Repo  r t  Mo  .  4159 


Bolt  Beranek  and  Me'vvman  Inc. 


5.5.2  Contents  of  the  Data  Base 

The  data  base  contains  utterances  for  all  the  diphones  that 
are  felt  to  result  in  different  acoustic  patterns.  The  types  of 
diphones  included  are  shown  below. 

C  stands  for  consonant;  V  stands  for  vowel 


DIPHOME 

EXAMPLE 

CV 

f  pal 

as 

in 

"pot" 

VC 

[ap] 

as 

in 

"  top" 

initial  cl  us  ter 

r  spr  1 

as 

in 

" spr i ng  " 

final  cluster 

[ndl 

as 

in 

"  and  " 

CC 

rsf  1 

as 

in 

"this  formant" 

VV 

r  i$l 

as 

in 

"reality" 

In  addition  to 

the  vowels, 

we 

have  included 

vowel  allophones  such  as  retroflexed  vowels  (vowels  followed  by 
frl),  lateralized  vowels  (vowels  fol lowed  by  fll),  Fal  (as  in 
"about")  ,  i-  (as  in  "multiply")  ,  syllabic  nasals,  and  syllabic  [I]  . 
The  consonants  include  silence,  flapped  ft],  unreleased  plosives, 
affricates,  and  glottal  stops.  Due  to  the  inclusion  of  all 
permutations  of  these  phonemes,  the  new  recording  includes  a  total 
of  1894  utterances  representing  3145  diphones.  However,  of  the 
3145  diphones,  a  few  cannot  occur  in  English,  and  many  more  are 
likely  not  to  be  necessary  as  separate  diphones.  Therefore,  we 
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expect  that  approximately  25'^n  diphones  will  be  necessary  to 
achieve  natural  sounding  speech. 

5.5.5  Recording  the  Data  Base 

Since  these  recordings  were  to  form  the  basis  for  the 
synthesis  to  be  done  in  the  remainder  of  this  project,  the 
recordings  were  monitored  carefully.  It  was  felt  that,  for  this  •< 

application,  very  low  noise  recordings  were  desirable.  The 

recordings  were  made  in  a  quiet  room.  The  microphone  used  was  an 
"electret"  condenser  microphone  positioned  2  inches  from  the  right 
corner  of  the  mouth  at  an  angle  of  A5  degrees  to  the  side.  The 

close-tal king  microphone  was  chosen  over  a  more  distant  microphone 
be_ause  it  allowed  us  to  attenuate  low  level  building-borne  noise. 

Also,  the  quality  of  the  microphone  was  judged  to  be  quite  high. 

The  recordings  were  made  on  a  Braun  TG-130fl  tape  deck. 

Each  utterance  (including  the  two  normalization  utterances  for 
each  group  of  diphones)  was  digitized  into  a  single  speech  file 
using  12  bits  per  sample.  The  dynamic  range  of  most  of  the 
utterances  only  requires  11  bits. 

5.5.4  Labeling  the  Data  Base 

In  order  to  extract  the  diphones  from  these  relatively  long  , 

(approximately  1  second)  diphone  utterances,  we  examine  each 
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utterance  and  indicate  the  identity  and  end  points  of  the  relevant 
phonemes.  This  transcription  must  be  accurate  since  the  spectra 
and  energy  parameters  derived  from  the  speech  will  be  combined  with 
voicing  information  determined  from  the  phoneme  identity. 
Therefore  misalignment  will  result  in  apparent  "voicing"  errors. 

Since  most  of  the  short  utterances  consist  of  three 
repetitions  of  the  same  syllable,  the  transcriber  must  also  decide 
on  which  occurrence  is  most  typical  for  each  diphone.  Most  of  the 
time,  one  of  the  diphones  in  the  middle  of  the  utterance  is  chosen 
to  avoid  the  effects  peculiar  to  the  ends  of  an  utterance. 

A  diphone  extraction  program  converts  the  transcription  text 
files  into  a  diphone  template  definition  text  file.  The  program 
determines  other  necessary  time  points  within  the  template  by  a 
simple  set  of  rules.  The  diphone  boundary  is  chosen  as  the 
midpoint  between  the  phoneme  boundaries.  The  "interpolation" 
points  and  the  boundary  between  the  elastic  and  inelastic  regions 
of  the  diphone  are  defined  as  a  percentage  of  the  way  between  the 
phoneme  boundary  and  the  diphone  boundary. 

As  the  synthesized  speech  depends  directly  on  the  diphone 
templates  used,  it  is  essential  that  the  different  diphone 
templates  are  chosen  to  be  compatible  with  each  other.  For 
example,  the  diphone  boundaries  should  always  be  at  the  same 
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articulatory  position  so  that  unnatural  articulations  are  not 
introduced.  One  way  of  encouraging  this  is  to  require  that  all  the 
transcribing  be  done  by  a  single  individual.  However,  since  the 
labeling  process  is  so  tedious,  we  feel  that  it  may  be  necessary, 
in  the  interest  of  time,  to  have  a  small  number  of  people,  worKing 
closely,  share  the  labeling  effort.  In  either  case,  it  has  become 
apparent,  through  experience,  that  some  immediate  auditory  feedback 
is  helpful. 

5.5  Programs 

5.5.1  Display  Programs 

Since  a  significant  part  of  this  project  consists  of 
accurately  and  consistently  transcribing  a  complete  set  of 
diphones,  we  have  modified  several  display  programs  to  facilitate 
this  effort. 

Our  general  signal  processing  and  display  program  was  modified 
so  that  a  user  can  interactively  edit  the  manual  phonemic 
transcriptions  associated  with  a  sentence.  The  program  displays  li"' 
time-varying  parameters  (such  as  energy  and  formants)  along  with 
user-defined  manual  transcriptions. 

Our  real-time  waveform  editing  program  was  modified  so  that, 
in  addition  to  displaying,  editing  and  playing  time  waveforms,  it 
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could  compute  and  display  the  instantaneous  spectrum  (log-magnitude 
power  spectrum  and  LPC  spectral  envelope)  corresponding  to  a  short 
window  of  speech  pointed  to  by  a  cursor  on  the  waveform  display. 
As  the  cursor  is  moved  relative  to  the  waveform,  the  program 
recomputes  and  displays  the  spectra  (corresponding  to  the 
instantaneous  cursor  location)  eight  times  per  second. 

The  program  also  allows  interactive  manual  transcription  of 
time  waveforms.  The  display  of  the  speech  waveform  and  the  short¬ 
term  spectra  provide  sufficient  information  for  the  labeling  of 
most  phoneme  boundaries. 

5.^.2  Compiler  Programs 

Several  programs  were  written  during  the  course  of  the 
project,  which  aided  in  managing  the  diphone  templates  extracted 
from  the  data  base.  Other  programs  were  written  that  performed  a 
"compiler"  function  to  enable  the  synthesis  program  to  access  the 
proper  diphone  template  from  a  large  data  file  quickly  and 
efficiently. 

5.6.3  Synthesis  Prooram 

The  synthesis  program  is  written  in  a  modular  fashion  so  as  to 
facilitate  testing  in  several  different  configurations.  To  further 
facilitate  testing,  the  input  sequence  of  triplets  can  be  specified 
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Py  several  alternative  sources.  For  instance,  the  triplets  can  be 
specified  directly  in  a  text  file  or  as  a  sequence  of  phonemes  and 
durations  typed  in  (with  pitch  values  computed  by  rule) ,  or 
indirectly  by  automatically  generating  an  ordered  sequence  of  CVC 
and  VCV  syllables,  given  a  particular  vowel. 

As  an  aid  in  evaluating  the  performance  of  the  individual 
synthesis  algorithms,  the  synthesis  program  provides  for  specifying 
the  source  of  any  of  the  (5  types  of)  synthesis  parameters  as 
natural  (directly  from  the  target  sentence)  or  synthetic  (from  the 
synthesis  algorithms)  .  of  course,  if  all  the  synthesis  parameters 
are  taken  directly  from  the  target  sentence,  the  program  becomes  an 
LPC  vocoder  . 

5.7  Experiments 

During  the  course  of  the  project  there  were  several  parts  of 
the  synthesis  program  that  needed  to  be  tested.  when  appropriate, 
an  experiment  was  designed  to  test  that  part  of  the  program.  These 
experiments  were  described  in  the  quarterly  progress  reports. 

5.8  Project  Status 

Most  of  the  algorithm  development  for  the  phonetic  synthesis 
program  is  complete.  Some  of  the  time-warping  and  gain  adjustment 
rules  may  undergo  slight  tuning,  but  no  significant  changes  are 
anticipated  . 
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The  major  time-consuming  part  ot  the  project  is  involved  with 
accumulating  the  large  set  of  diphone  templates.  At  present,  the 
new  data  base  has  been  completely  recorded,  and  is  mostly 
digitized.  The  labeling  or  manual  transcription  of  the  diphone 
templates  is  half  completed. 

We  found,  after  transcribing  half  of  the  diphones,  that  some 
sort  of  auditory  feedback  to  the  transcriber  was  necessary,  in 
order  to  assure  that  the  time  boundaries  were  placed  correctly.  We 
therefore  augmented  the  diphone  template  compiler  and  the  phonetic 
synthesis  program  to  facilitate  testing  of  newly  transcribed 
diphone  templates.  We  expect  that  these  additions  will  ultimately 
improve  the  quality  of  the  synthesized  speech,  and  speed  up  the 
labeling  process. 
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A  Class  of  All-Zero  Lattice  Digital  Filters: 
Properties  and  Applications 

JOHN  MAKHOl?L.  SKMOK  mfmbcr,  ih.i 


Abstract- A  clu&s  of  minimum'  ur  ma.ximum-ph.ise  aJI*mo  lattice 
digital  liltcrs*  ba^ed  on  ilie  (%^o-mulfiplicr  lattice  of  lukura  and  Saito. 
is  developed.  Different  lattice  forms  vsith  different  num!>ers  of  multi¬ 
pliers  are  derived,  including  Imo  unc-multiplier  forms.  Many  of  the 
properties  of  these  lattice  filters  are  given,  including  ihe  important 
orthogonaltzation  and  decoupling  properties  of  succc’^sive  M.igcs  in  opti¬ 
mal  inverse  filtering  of  signals.  These  prupertieN  lead  to  impfiriant 
applications  in  the  area>  of  adaptive  linear  prcdiciion  and  adaptive 
Wiener  filtering.  As  a  specific  example,  the  design  of  a  iicsc  fa»i  start-up 
equalizer  is  presented. 
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l*his  work  was  supported  by  the  Intorrnation  ProccNSing  lechniques 
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MDA903-75<-0l80. 

The  author  is  with  Boll  Bcr.mck  and  Newman.  Int.,  Cambridge. 
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I.  InTHODLi  TION 

^^bVF-RAl.  bttic?  anj  ladder  structures  have  been  proposed 
F^'t'or  the  implementation  ol'  all-pole  and  polc-7eio  digital 
fii’.ers  [ 1 1  -|51  .  However,  only  a  single  lattice  striiciure.  due 
to  Itakura  and  Saiio  ( 1 1  ,  is  available  for  the  inipleincni  iiion 
of  all-7ero  liltets  that  are  restricted  to  be  inmim'.iin  pha>e  or 
inaxiniuin  phase,  lids  lattice  fiiicr  structure  has  been  useful 
in  speech  analysis  applicaiioiu  (o|  and  promises  lo  be  useful 
in  other  areas  as  well,  wherever  transversal,  rreutciive.  nr 
linite  impulse  response  (FIR)  filters  are  used  in  an  adaptive 
manner. 

Tlie  lattice  of  Itakura  and  Siiio  had  two  niulii[)lierv  .11  eacli 
stage.  In  this  paper,  one-,  two-,  tliree-,  and  four-multiplier 
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-.triiciuro-;  wiil  bo  dovolupoil  for  the  iii'.ploriiciitalion  ot 
iiiii’iimiiii  oivi  iiKi\i:'.uim  plu'O  all'/.oro  tillers.  Ol  paiiioular 
ii’icrest  IS  the  i .no-iimlnplicr  form,  because  of  the  decreased 
r.  iii'.biT  oi  mulnplications.  Those  striicnires  arc  presented  in 
Sooti.  n  iV  I’.^'fore  that,  liov.ever.  Section  li  presents  the 
b  I'lC  l.ittioo  I'l  liakiira  and  Saito  and  develops  some  of  its 
po'perlies.  .'.Toe  properties  are  derived  in  Section  HI  where 
ti'.e  epphcation  of  ilie  lattice  to  linear  prediction  is  presented, 
(jf  importance  ate  the  ortlu.gou.dization  and  decoupling 
p.operties  of  'he  '.uccessive  stages  in  the  lattice.  These  proper¬ 
ties  are  ih.en  used  in  Seetion  V  r<i  -ho.v  liow  the  lattice  can  be 
employed  veiy  profitably  :n  tlio  areas  of  adaptive  linear  pre- 
d’.etion  .nut  aiiaptive  V.'iener  tiltering.  .As  a  specific  e,\ample, 
the  design  of  a  new  and  elliciotU  fa<;t  start-ip  equalizer  is 
presented. 

One  ut  the  intpottant  properties  of  the  lattice  that  is  not  dis¬ 
cussed  in  lilts  paper  is  its  low  sensitivity  to  ronndolt  noise 


'f. 

i(n) . 

f 


K,' 


f.(n) 


9,(n) 


f>) 


■irin) 


f  to) 


l.t") 


l-'ig.  1.  The  bjvic  iwo-n'.ultiplicr  dil'/cro  lattice  of  i'ikara  and  Saito 

IM.lfl- 


Thus,  if  IS  given  by 

((>a) 

k=o 

where  a,„{k)  are  the  polynomial  coefllcients  for  an  w-stage 
lattice,  then 

"■  L, 

,  (6b) 

)c“o 


II.  Li\sic  .Xi.l-Zkko  Lattice 

Fig.  1  shows  th.e  basic  two-nuilliplicr  lattice  of  Itakura  and 
S.  ito  (1),  winch  th.ey  used  for  performing  speech  analysis. 
F,om  Fig.  1 ,  the  following  relati-.ins  Itold: 

=  T.jt't)  =,v(u),  (la) 

/,„(u)  -li  t)  *■  A, (.-I  ■  1),  (lb) 

g-„.('i)  =  A',,,/.,  ,(.>i)  t-g, 1).  (le) 

':\H)  is  the  inp'.'t  >ig:ial.  is  the  “forward”  residual  at 

stage  w.  and  is  the  “h  Akward”  residual  at  stage  m. 

In  -•■transform  notation .i  1 1  can  t-e  rewritten  as 


Fo(z)^G.J:)  =  X(z). 

Ca) 

F,n(- 1  -  -  1  (z)  E  f/,„-i(z). 

Cbj 

ff  fM  ( T  )  -  Arri  -  1  (  -  1  ^4  0 fH  -  1  (^ )  • 

(2c) 

Let  the  forward  and  backward  tiansfer  functions 

at  stage  m 

be  defined  by 

F,„(z)  _  r„,(z) 

.V(J)  /■„(:) 

(3a) 

jnJ 

. .V(Z)  "  f;„(z) 

(3b) 

Then,  from  (2)  and  (3)  it  is  simple  to  see  that 
/i,„(zi  obey  the  rccuision  relations 

.(,„(:)  and 

■■lo(-)  “  Fo(z)  =  1 , 

(4a) 

.•(„,(Z)  =  .I„  -  ,(Z)  E  A'„z'‘ (z). 

(4b) 

/3,„(z)  =  /3m- il^)- 

(4c) 

Furthermore,  one  can  siiow  from  (4)  that 

/3,„(z)  =  z-'ri,„(z-'). 

(5) 

and5^(r)is  the  reverse  polynomial  eonesponding  loH,„(2). 
From  (4)  and  (6a)  we  also  have 

am(0)=  1. 

U^(f/l)=A'm  c^) 

Now,  given  some  polynomial  .1^(2),  with  Up(0l  =  1 .  one  can 
generate  .ill  the  pi dy nomials .  t (z),  rn  <  /a .  and  the  coe  1 1 icien Is 
,  using  the  following  reverse  recurjion  derived  from  (4): 

A',„  =ii,..(m) 


-1'“'  <  -  A  - 

along  with  (5),  and  oeginning  '.vith  in  =  p.  It  is  clear  from  (8l 
that  should  1  for  some  in  =  in' .  then  the  solution  for 

H„,'-|fr)  is  indeterminate.  Therefore,  the  reverse  recursion 
(8)  is  possible  iff  \K,„  I  s*  1 .  for  .ill  in. 

It  also  follows  from  (5)  and  to)  that  the  zeros  of 
are  the  reciprocal  of  the  zeros  ol  .1,„(2).  In  particular,  if  all 
the  zeros  of  .-((filz)  fall  inside  the  unit  circle,  m  which  case 
.-1„,(;)  is  minimum  phase,  then  is  nia.ximum  phase. 

One  can  show  that  tiie  minimum  phase  condition  for  .f;n(2) 
is  guaranteed  iff 

-!<A',<1.  (9) 

The  coefficients  K„,  are  then  known  as  rcjlection  coefficients 
or  partial  cnirebtion  coefficients.  In  much  of  the  paper  we 
shall  assume  that  (4)  holds,  and  therefore  A,„(:)  and  /?,„(;) 
are  minimum  phase  and  maxinuim  phase,  respectively. 

A  R  csidual  Ener^ 

From  (3a),  the  forward  residual  /,„(/i)  can  be  represented  by 

F,J2)=z(^(.’).V(2).  (10) 

By  noting  that  the  energy  of  is  equal  to  its  zeroth  auto¬ 
correlation  coefficient,  one  can  show,  using  ( lOi,  that  the 


MAKHOIJL:  ALL  /1:ki»  LArT:t  t:  UKilFAL  ULTKRS 


30S 


hiiiice  structures  will  ho  Jevciiipeil  for  the  iinplomentation  of 
ininiuuini  uilJ  iiiaxir.iuin  pha'e  .'i!l-/cru  tillers.  Ol  particular 
Iptcrost  IS  tiro  c-.tio-uniltiplier  form,  because  of  the  decreased 
iiun'.bor  of  imiliiplicatious.  Those  structures  are  presented  in 
.'soctioii  iV.  Kofore  tliat,  however.  Section  II  presents  the 
b  iMc  lattice  ('f  liakura  and  Saito  and  develops  some  of  its 
properties.  Mt.uo  properties  arc  derived  in  Section  111  where 
tl-e  application  of  the  lattice  to  linear  prediction  is  presented. 
Of  importance  aio  the  orthogon.ili/jtion  and  decoupling 
p.'operties  of  I'lO  successive  stages  in  the  lattice.  These  proper¬ 
ties  are  then  used  in  Section  V  ro  -.ho-v  how  the  lattice  can  be 
employed  very  profitably  in  the  areas  of  adaptive  linear  pre¬ 
diction  and  adaptive  Wiener  i'lltering.  .As  a  specific  e.xample, 
the  design  of  a  new  and  etficient  fast  start-up  equalizer  is 
presented. 

One  of  the  important  properties  of  the  lattice  that  is  not  dis¬ 
cussed  in  'ills  paper  is  its  low  sensitivity  to  roundoff  noise 

P|.!8|. 


l,ln) 

V"> 

«ln>[ 

I-' 

K,*  ^ 

K  '  ' 

. 

1-'  / 

.O-,-  — “  ,^(n) 

Fig.  1.  Tlic  ba.sic  iwo-muitiplicr  alF/cro  lattice  of  Itakura  and  Saito 
(II.16I. 

Thus,  if IS  given  by 

Am(z)  =  ,  (6a) 

k-o 

where  a,„(k)  are  the  polynomial  coefficients  for  an  /«-stagc 
lattice,  then 

fl„(2)=  -  A:)z‘*,  (bb) 

*>0 


II.  Basic  Ai.l  Zeko  Lattice 

Fig.  I  shows  tlie  basic  two-multiplier  lattice  of  Itakura  and 
S^ilo  [1],  which  they  used  for  performing  speech  analysis. 
Fiom  Fig.  1 ,  the  following  relations  hold; 


fiM)  -  = 

(la) 

)  *■  I -  1  ( *^0  ^  ~ 

(Ib) 

~  -ii  1 ^  -  i  ’  0* 

(Ic) 

v(/i)  is  the  iiip'.'i  sigiul.  /„,!«)  is  the  “forward"  residual  at 
siagc  ni.  and  x,„tn)  is  the  “backward”  residual  at  stage  m. 
Ill  -•■transform  notation,!  1 )  can  be  rewritten  as 

Fo(z\  =  Caf.-)  =  X(z), 

(2a) 

Fm^Z)  =  r  m  -  \f-}  ^  KmZ  'f»m-l(»)> 

(2b) 

Gm{Z)-KmFm.i{z)^Z''G„,-l(z). 

(2c) 

Let  the  forward  and  b.ickward  tiansfer  functions  at  stage  m 
bo  defined  by 

1  z  ^ 

■■  .VC)  ■  ro(2) 

(3a) 

and 

GmiZ)Jlm(Z) 

’  X(z)  GoU) 

(3b) 

Tlien,  from  (2)  and  (3)  it  is  simple  to 
Bmlz)  obey  the  rccuisiun  relations 

see  that  Amiz)  and 

.•lo(2)  =  flo(z)  =  1. 

(4a) 

Am^Z)  -  Am  -  i(z)  *  KmZ  Bm-i(z), 

(4b) 

Bm(Z)  =  KmAm-i(z)*Z''Bm.i(z). 

(4c) 

Furthermore, one  can  show  from  (4)  that 

B„,{z)^2-'”Am{z-'). 

(5) 

and  B„(2)  is  the  reverse  polynomial  coricsponding  to  Amiz). 
From  (4)  and  (6a)  we  also  have 

<'m(0)=  1. 

a„(m)  =  f(„.  (7) 


Now,  given  some  polynomial  .-Ipiz).  with  UptO)  =  I .  one  can 
generate  all  the  p(dynomials.l,p(2).«i  </;,  and  the  coefficients 
Km ,  using  tlie  following  reverse  recursion  derived  from  (4); 


Km  ■,{>») 

AmtZ)-  KmBm(Z) 


vlm-lU>  = 


1  -  K;„ 


(8) 


along  with  |5),  and  oegmmng  with  m  =  p.  It  is  clear  from  (8) 
that  should  iA'„,  1=  1  for  some  m  =m'.  then  the  solution  for 
Am'-i(z\  is  indeterminate.  Therefore,  the  reverse  recursion 
(8)  is  possible  iff  IA'„  I  ^  I .  for  ail  m. 

It  also  follows  from  (5)  and  (tj)  that  the  zeros  of  Bm(z) 
are  the  reciprocal  of  the  zeros  ot  .l,„(;).  In  particular,  if  all 
the  zeros  of  .-Imfzi  fail  inside  the  uni!  circle,  m  which  case 
.4„(;)  is  minimum  phase,  then  is  ma.ximum  phase. 

One  can  show  that  the  minimum  phase  condition  for  Am(z) 
is  guaranteed  iff 

-!</:,  <1,  l<i  <w.  (9) 

The  coefficients  K„  are  ihcn  known  as  reflection  coefficients 
or  partial  correUtion  coefficients.  In  much  of  the  paper  we 
shall  assume  that  (9)  holds,  and  therefore  Am(z)  and  fl,n(2) 
are  minimum  phase  and  maximum  phase,  respectively. 


A.  Residual  Energy 

From  (3a),  the  forward  residual  fm(n)  can  be  represented  by 

F„(z)^A„(z)X(z).  (10) 

By  noting  that  the  energy  of  /,„(/i)  is  equal  to  its  zeroth  auto¬ 
correlation  coefficient,  one  can  show,  using  I  lUi,  that  the 
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n^o 


l-n  = 


0 


(27) 


(28) 


(20) 


Ni).\.  fron’  IL>b).  \vo  v:c  'lia!  ilie  rov.s  of  Cp  arc  tlie  cocffi- 
■  ;c’iu  oi  ;i)c  backwar'i  lil’ers  /i,„(r).  Tiiercforc ,  abiiia  ( 3b), 
oiic  can  A  rile 


^P‘^P  ~^p  (aLa) 

Al'c’re 

Xp  =  (.rl.'i  i.vDi  -  1  )  ■  •  .v(;!  -  ^  (301)) 


anil 


!p  ■  -.('i)  •  ■  •  iphi)\  ''  (3()li 

I'iic  :!;.:oLorrc!ai]on  iiuirix  is  :)!cn  ^ivcr’.  by 

R^,  =  if-.Cp  =  CpXpX'l,  Cp 

^C.RIlZ.  (3i) 


A  ill.,!),  from  (  27),  IS  equal  to 

Rl=Fp.  (32) 

whicii  IS  a  Ji.igona!  matri,\,  .Another  way  to  write  (32)  is 


,C,(")i',(") 


F,- 


(■  =  /. 


(33.) 


I  U,  ii=i. 

This  means  that  the  baekwarii  residuals  in  ihe  lattice  are 
orthogcjual  to  each  other.  Thus,  the  backward  residuals  are 
the  result  of  an  orthoaonali/.ation  process  (of  the  Gram- 
Schiuidt  type)  on  delayed  versions  of  the  signal  .vf/i).  .A  dii- 
ferent  derivation  of  (33)  as  well  as  other  correlation  proper¬ 
ties  of  the  forward  and  backward  signals  are  given  in  .Appondi.x 


!l. 

The  orthogorialia:ation  process  results  in  a  .ftwrrp/f/ig  of  the 
;uccessive  stages  I'roin  euvh  other.  i)ne  ot  Itakura’s  contribu¬ 
tions  was  to  rccogni/e  thus  decoupling  and  'ise  it  to  estimate 
the  reneclioii  coefficients  in  a  straiahtforwarri  manner.  In 
fact,  one  can  show  rhat  the  gl.jbal  itL'nimi/  ii.on  of  the  output 
residua'  energy  can  be  accmtiphshed  as  a  sequence  ol  local 
miiumi/ation  problems,  one  .it  each  stage,  rims,  one  can  ob¬ 
tain  t!ie  optimal  value  of  K,„  that  mmimi/es  F^  by  settmi; 
the  d'.uivative  of  F,„  m  1  K>)  with  respevt  to  K„  to  zero, 
and  noting  that  .md  are  not  functions  of 

The  answer  is 


A'*,  ^  n4) 

where  is  rite  eorrelation  coefficient  m  (15)  between  the 

two  inputs  to  die  wtii  stage.  Substitutm.g  (34)  m  (lb),  the 
minimum  residual  energy  at  stage  m  is  computed  recursively 
from 

/f^  =(1  -A'* ')£•*-! .  (35) 

which  gives  the  mimmuin  residual  energy  at  each  stage  in 
terms  of  tire  muiitmiin  residual  energy  of  :lie  previous  stage 
and  the  reficction  coeiiicient  of  tlie  p.'e-ent  stage.  This  is 
yet  another  manifesta'iori  of  the  decoupling  between  stages. 

f'Tom  (  -14;  and  (20)  we  conchule  that  K obeys  f9),  and 
titereforc  t'lat  tire  fil'er  Apiz)  titat  njuuiuizes  tlie  output 
residua!  enei.gy  is  minimum  pliase. 

The  mmunum  normaii/ed  residual  energy  at  eadi  stage 
»t  Is  obtained  l.'■om  ( 10 a)  and  (35): 


= 


11(1  -Kr'), 


(3b) 


where  .k',*  is  given  by  (34).  From  the  discussion  above,  it  is 
cleai  that  V,',  is  bounded  by 

(57) 

Tims,  while  for  an  arbitrary  filter  can  have  values  greater 
than  one.  as  in  i21)  and  (22).  using  tire  filter  tliat  minimizes 
the  output  residual  energy  results  in  a  that  is  less  than 
one. 

Because  of  the  equality  of  the  energies  of  tlie  forward  and 
the  backward  residuals  in  (12).  tire  correlation  coelticienis 
.1  in  (15)  van  be  computed  in  different  ways.  Another 
definition  of  and  '-.jnee  K  .  is  ol'tained  by  minimiz- 

rng  the  sum  of  tlie  forward  and  backwaird  residual  energies 
at  each  st.ige.  The  answer,  due  to  Burg  [i0|  .  is 


.  ,  _  . _ 2.%.,('(h?m-i("  -  II 

R  m  '  -  1  *  — ; - m - — 

/'m-l'")  +  i'm-|("  ■  ') 


(38) 


Both  definaious  (  14)  and  (3.S)  guaraiilee  condition  (9),  even 
for  a  tinae  signal,  and  hence  guarantee  that  .4,„(c)  is  mini¬ 
mum  pliase.  Other  defimtions  for  A',„  tliat  guarantee  (9) 
can  be  found  in  (11|.  In  tlie  same  reference,  the  autlior 
develops  more  efficient  methods  to  compute  ( 14)  and  (38). 

Tlie  values  of  r,„ ,  and  iherefoie  A',*  .  m  (  14)  and  (38)  are 
erpual  if  and  .inly  if  /  ;„-i  (u)  =  (u  -  1),  which  happens 
generally  only  if  the  signal  is  stationary  or  windowed.  Differ¬ 
ent  values  result  if  the  signal  is  not  wmdowed  [1 1 1 ,  but  both 
(14)  and  (38)  .rontinue  to  obey  (9). 


IV,  Ai.l-Zkro  Lattick  Rkali nations 
.1.  A I  rerun  te  Forms 

In  this  section  we  shall  develop  several  new  all-zero  lattice 
realizations  in  addition  to  Itakura's  original  realization.  The 
resiiltine  fomis  are  canonic  in  the  number  of  delays.  Some  of 
the  realiz.itions  are  actually  in  ladder  form,  but  we  shall  use 
the  word  l.itlice  to  include  both  lattice  and  ladder  forms.  For 
each  of  the  realizations  only  a  single  representative  stage  of 
the  transfer  function  will  he  slu.iwn.  The  actual  signal  at  each 
stage  is  obtained  by  multiplying  the  transfer  function  (.-l^fc) 
nr/?,„(r))  by  :Iie  input  .V(r), 

Latiice  I'or'n  I  (LFI)-  Fig,  2  shows  a  st.ige  of  this  lattice 
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l  ig.  Tlie  «nli  sl.i'.N  1)1  lattice  form  I  iLI  I). 

form,  which  is  ihe  form  ii-cJ  lit  Fie.  1.  ,int!  is  reprcscnletl  by 
(4).  The  form  lias  two  muluphosaml  two  .itids.  Tlie  adaition 
I'lOdcs  are  shown  e.\plrcitly  to  diiieientiale  llioin  trorn  br.niclr 
nodes. 

Lattice  Form  2  {LF2):  .\dd  and  stihTact  )))[  1*1  frt'd  ID  -  r  ^  ^ 
from  (4b),  and  add  and  s'lbiia^t  A  *  !>,„  - 1  (4 )  iK.irn  (-)■-). 
where  s„,  is  a  sign  parameter. 

s„  =  1 1 ,  and  s;;,  =  I .  (aO) 

one  obtains 

.4„,(4)  =  (1  ~  s,„ ) .1  „i  - 1 14 )  A  S;jj A.’,,;  7„, f- )  ("ifl) 

fi„,(;)  =  A-,„r,„(4)f  (1  -s,„A-, (4!) 
wlicre 

r,„(4)  =  .4„,.,(4)t  s, (4).  (42) 

Tltese  equations  represent  lorm  TFZ.  whii.h  can  be  sliown  to 
base  three  inuittrlies  I  ''.e'eleciii'.e  'he  n'.uir.n'.ies  bv  ! )  and  ’Ir.'ee 
adds.  The  choice  ot  liie  sign  or  s,„  lor  easli  slage  is  not 
important  here.  However,  we  >iiali  see  liiai  tb.e  cltoice  wnl  ne 
imp.jrtant  m  the  one-multiplici  rcaii/ations  in  Section  iV-B. 

Lattice  Forni  SiLLS):  1  t  .'  ist)biamed  nmii.i'iy  to  LF-.bni 
by  addins  and  stibttactins  s,„4  '/i„,  .,(4l  from  (4b),  end 
adding  and  subtracting  s,,,.!,..,  - 1'4  '  itoin  (4c).  TF.c  result  is 

.4^(4)  =  r,„)4)  +  (A-,„  -  j„.I4-'«,„-,(4I 

/f,„(r)  =  (  A'„,  -  s,„  ).1„  _i(4' bnT'mIe)  (44) 

where  7',„(4)  is  given  by  (4:).  L.F3  uses  two  nmlliplies  and 
tli'^ee  adds. 

By  substituting  for  .4,„_,(4l  ff‘>''i  (^4bi  in  (4c)  one  obtains 
■I  different  form  -viih  'Inee  ni'illipi'es  end  two  adds  simil.ir 
form  is  also  obtained  by  substituting  Mr  ;"'/i.„_jl4)  tioin 
(4c)  111  I4b).  The  details  are  lelt  to  tlie  leader. 

.Mtitongii  the  additional  litli'.e  torins  deiived  above  are 
i.nleresrintt  variations  on  the  basic  lattice,  iliev  dvi  not  soem  to 
otter  any  parltcnlar  advantarcs  over  LFl.  in  lact,  they  ah 
require  more  compulations,  either  m  ilte  rinniher  ot  niulti- 
plicaiions  or  tiie  number  ot  .iddiiioiis  or  b'Uli.  However,  by 
appropriate  manipulation  it  is  pi.issible  to  iranstimm  LF2  and 
LF.s  to  forms  witli  only  a  single  muiiiplier.  lliis  is  described 
below, 

U  Onc-Miiltif.'ltcr  Fi'rms 

The  lattice  Iransier  I'lmctron  may  be  sc.iled  b>  mnlliplying 
each  stage  by  some  multiplier  1  The  scaled 

transfer  funetions  are  then  given  by 

Apiz }  ~  PpA p  1  w  ) 

hp{-.\  =  Pplip^z)  <-*5) 

wire  re 

p 


Lqualioi.s  (40i-«4;i  tor  TF:  .ire  'her'  Iransi'oimed  to 
.'( l4 1  =  .l/,r(  [( 1  "  Sj,,  AT,,  1  .'1,.. .  j  f c  1  A ,si  1*  )1 

^’r,i<4;  =  ■'All  ( 7)  A.„  (4 )  V  (i  s,„j\,„!Z  U,„.jlztl  (4  ) 

where 

7',,, 14)  =  .d,„  .,14)  +  Sm4"'/<,„  -1<4;.  (4S) 

By  setting 

)/  =  - ! -  (49) 

■*‘ni  1  /' 

I  '  ^rn  m 

(47)  becomes 

.4  ( r )  =  a-1  ;  I  n  +  Sffj  ~  ~  ~  7'r7j  •  -  1 


/f,„(4)=— 7-,„(4)^r->/7,„..(4,.  (SOI 

The  How  duc'ams  for  ( 50)  .,re  sli  wvn  in  Fu.  3i  a)  lor  s,„  =  *  1 . 
and  in  Fig.  3(bl  lor  =  -  I  This  'orm.  winch  wc  shall  label 
I.F2,  1,  has  only  a  single  nitiltiply  ou,  the  three  a.ids  reinain. 
Thus  rlie  lotai  n'nnbcr  or  opct.iii"ns  is  e.quai  to  th.ii  or  LFl. 
cwcepi  ihji  oiie  rntiitiply  has  been  ^rpiacc-i  by  or’.-c  cJd.  S', here 
muilir'iicata m  is  esrenst'-e  Csrp.pu',;'.  ’.ail',  rr  ‘c,  “ar-SAare 
impicmen;,,!!,.!',.  LF2  1  can  result  .n  '  msianiLil  s,iv:rcs. 

LF3  can  be  iiansfotnieu  s!m:i.ui;  using  tne  muit'plier 


-  '  n:  ,  ■ 

^  r>:  ~  ^ 

wkI’.  (43)  and  i44i.  The  ^es'iii  is 


.•l„,(ci  =  - - -  r„,l4)  •  (4! 

^  ‘  5... 


/j  „,  ( 4  1  —  .-I ,,..  .  ]  ( -  i  ■t  .  A,,,  1 1 . 

A  "s’ 


The  t'n'vv  giap' 

I'.s  are 

siiown  1!)  Tig.  4  far 

,r.J  we  .siiail 

..ill  tins  i’.,: 'll  i 

i  7  I 

.  It  aisii  has  one  mu; 

and  ; 

nree  adds. 

RcaJiial  Ft: 

cr^v : 

Tire  re'-  aua!  energ' 

c  a:  btaje 

fo-r  tire 

tijust'oimcd  L 

ttisO  is  given  from  (  1  i  i  am 

j 1-5 i b> 

F.„=r%  |j  n,„(rl7?(ti.  (53) 

i--m 

and  from  i  !,S)  .rnd  (Upi  by 

7f,„  =/f(0l  1!  f  ..r,.,A',  -  A-,=  ).  (54) 

1^=  1 

In  the  cr'n('-'\(  of  linear  predisition  analysis,  the  'ftiiufiium 
residua;  cv^'t^y  tor  (.he  ircn.'for.'n.'d  i.idice  is  given  trem  (ct'i 
a,.d  (4())  by 

/s  rn 

=  /HOi  [  1  .'./,H1  -  A',*M,  (55) 

For  LF2,'^  i .  we  liavc  fioni  (  e'M.  (4V  i  a. id  ( :  5 ) 
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A„(:)  A  ^  ,(;) ' 


»  .  .  A^(i) 


(ol  »bJ 

J  :;;.  3.  One  multiplier  lattice  form  (  12;  I .  Li)  =  +■!.  'b)  s„f  ~  -I. 


■Vfi-/'’  *  *  ^*)  •  '  A_  ,14)>  —  •  3^  •  ■ 

'5  - ,'- ;  0-  - 

I  ■  :  ‘  *  ,  •■'  !  *  *•' 

,'i-  '-'•  “.-1  0  -  3„!/)  |(i)-  ';  •  '^-i  0-  a„;i) 

!il)  (6) 

r  "v.  4  Onc-iiiuUiplicr  lattice  form  Ll'3/1 .  fa)  s,„  =  »■  1 ,  i  n|  f,„  =  - 1 . 

One  can  also  ahow  tliat  the  iniiiiinum  residual  energy  for 
LF3  1  is  given  by  (56)  as  well,  sineo  (1  -  ={K,„  - 

t.,,  )■ ,  The  inininnun  nurinali/ed  residual  energy  Vf^  in  each 
ea'.'  can  be  obtained  simply  by  dividing  (55)  or  (56)  by 
A’(0), 

Oi’i’cc  of  SiL’n  P::riimc:cn-.  One  can  choose  the  sign  parame¬ 
ters  to  suit  any  particular  application.  In  finite  wordlength 
(F'VL)  conipuiai'ons,  one  is  iiften  interested  in  miniinizing 
I'lo  scaling  I'per.iiions  lo  make  ma.xiniiim  use  of  the  number 
of  bus  available.  One  leasonable  criterion  lo  meet  [4]  is  to 
have  'he  enerex'  at  each  point  m  the  lattice  he  as  constant  as 
P'lssihie.  ind  in  particuiar  not  to  e  xceed  some  specified  over- 
oow  '.aiue,  say  T .  .\i  eacli  stage  compute  E ^  from 


E,„  = 


_r,n  _  1  A  ,» 


1 1  ■  s  A  j 


■E, 


(57) 


lor  .s,„  =  I- 1  ;nd  =  -  1 ,  and  choose  tlie  value  of  .v,„  that 
results  in  being  closer  to  but  not  exceeding  the  thieshold 
r  For  an  optimal  inverse  filter  (one  tliat  niimmi/es  the 
reciJuai  enercy  l.  tlie  residual  encig;.  is  computed  fiom 


1 


L-  * 

i-s.  "I  -1 


(58) 


.Since,  in  this  case,  !A'7nl<  F  "iie  ean  increase  or  decrease 
relative  to  A,', -i  hy  proper  choice  ofs^.  Thus,  for  sy„  = 
sgn  A'.* 


e:, 


F ' 

r. 


1  iaf:,i 

1  -  'A'Al 


> 


•  m  -1 

and  for  s,„  =  -  sgn  A'A 

E*  I  -  lA  *  I 

•*  .  1  4-  ^iV  *  I 
m  -  1  *  m  ' 


r. 


<  1. 


C.  Sunnalizcii  Forms 

!5y  specifying  appropriate  multipliers.'/^  at  each  stage , one 
can  ensure  that  the  normalized  energy  at  each  stage  is  unity 
(7|  The  resulting  forms  shall  he  called  iiormalizal  forms. 
Below,  xve  shall  assume  that  the  filter  is  the  optimal  inverse 
filler  and.  hence,  that  the  minimum  residual  energy  is  given 
by  I  55 ).  The  noimalized  minimum  energy  is  then 


=  [-[  .'/.M  -A'r.i, 


(61) 


where  we  have  Jropi'cd  the  asterisks  for  convenience,  ll  is 
clear  from  |61)  ih.it  l'.„  cxiii  be  m.ide  'o  equal  unity  by 
setting 


,'/,  - 


1 


V  I  -  Kf 


f62) 


at  each  stage.  Iiurodticing  '62i  in  lattice  firms  I-.'  results 
in  the  corresponding  aoinuiized  foiuis.  Below,  we  shall  give 
the  results  for  lattice  ioims  1  ,ind  2  onlv ;  tlie  others  can  be 
derived  in  a  similar  fasiti'in. 

XoriiwIizeJ  Lutrice  E<,rin  I  W'LEIy.  Multiplying  (4)  by .)/,,, 
in  ( 62 ),  one  obtains 


.1  «(-')  = 


vr 


— —  •  I -1 )-)  ■^  ■; - z  .,(r) 


A-.,, 


V  1  -  A';„ 
1 


A,„(Z)=^rr0cmr-,l,„.,U)+ 

V  I  -  A'?.,  V'l-A'^ 


^  m  -  lie). 

(63) 


By  setting 

h.  fn  ~  COS  (64) 

w  here  u',„  is  an  aiig'o  between  -z  and  (63)  reduces  to 
,4  .,1 1  - )  —  CSC  H  .,1 . 1  .,j  _  1  (_  1  4  ctn  ^  /),/(_[(_) 


Z/,„(r)  -  ctn  iv„,.4,„  .,(r)  i-  esc  iv.„  ll, 


,(--). 


(65) 


The  resullirig  noimah.'cd  6  nii  is  shown  in  Fig.  5. 

S,  rmiilizcd  Ljlri.v  E  ,:i  2  iM.E  I)  ‘hihslilutmg  f.r  .'/„ 
Irom  (62)  into  (  )7 ),  tlie.c  results 


-I  .„(c)  - 

B„,iz)  = 


A'„ 


V'  1  -  A' ; 


V  1  -  El 


^  ~ '  /y 


Vio 


r,,pz) 

in  -  1  I  -  ) 
(66) 


(59) 


(60) 


which,  by  substituting  (o-i). can  be  shown  to  redtice  to 

/  vv  \ 

•■Fii(-)  =  ^>4n  —  j  .?„,_,(.-)  f  s,,,  cm  u,„  f,„tz} 

/  .  \ 

//„,(- )  “  ctn  IV T,fj  1  ^  I  r  f  {,11}  z  //,sj_[(c). 


(67) 


The  How  graphs  for  s,„  =  +  I  .ind  -  1  aie  siiown  in  Fig.  6. 

Finally,  we  note  that  in  the  case  where  the  lattice  is  the 
optimal  inverse  filter,  the  backward  residn.iK  in  the  normalized 
structures  are  ortlionormal  to  within  a  imiltiplicalive  constant: 

A(0). 

10, 


e.iingfn) 


‘  =  1. 

i  Ej. 


(68) 


V.  .M'i'l.l'WTItl.NS 

In  this  section  we  piescn  two  applications  of  all-zero 
lattice  digital  ('liters’  1  l  ailap'iie  Ime.ii  prediction;  .ind  2)  .id.ip- 
tive  Wiener  lilte.mig  in  die  ioim  ol  an  cllicicnt  fast  slart-iip 
equalizer  In  both  apphc-itioijs  one  can  use  any  of  the  stiiic- 
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c»c  «_  _ 

ctn  • 

c»n  wm  ^  ■ 

-  -O  — B„lz| 

CSC  Wn, 

I'ij.  5.  Normali.'cil  Ijiticc  focin  M.P  i . 


m*  1 

m-l  , 

1 

:‘n 

,(r)o-V-  . 

No  • 

- ©— -B„(zl 

J-1  * 

(a)  (b) 

I'4:.  6.  NurmalUcd  lattice  furm  NLI  2.  la)  =  +1 ,  (b)  =  -1 , 


tures  given  m  tins  paper,  but  the  one-multiplier  structures  are 
of  special  interest  because  of  the  reduced  number  ofniultiplies. 

A  AJaptire  Linear  Prcdictu m 

Adaptive  linear  prediction  or  adaptive  inverse  filtering  has 
been  useful  in  nuny  applications  sucii  as  speech,  radar  and 
sonar  procesMiig.  and  adaptive  noise  canceling  (see.  lor  e.xam- 
pie,  112]  ).  In  those  applications  it  has  been  customary  to  use 
a  transversal  lilter  to  nerMrm  the  filterina.  Ih'wever.  witli 
some  increase  in  computation  one  can  use  me  ail-zero  iattice 
to  give  moie  reliable,  stable,  and  less  noisy  esiniiates  ui  the 
filter  coefficients,  lieii.nv.  we  describe  brielly  'ine  method  lor 
adaptively  compulina  the  reflection  coefficients;  other 
methods  are  given  elsewhere  j  1  1  ]  ,  1 1  .’|  ,  1 14|  . 

Given  K„pn).  1  <:in<p.  at  time  ii.  and  the  foiward  and 
baekieard  signals  up  lo  lime  n.  we  shall  compute  K,„{»  +  I) 
using  the  following  realization  ot  (3Sl: 


K„M  P  1 )  =  - 


where 
0<|3<  I- 

Tlie  lower  limit  t'o  in  the  summations  in  l6d)  depends  on  the 
type  of  estimator  memory  to  be  used.  We  dnfeientiatc  two 
types  of  memory 

I  1)  Crowing  memory  ii,,  is  a  tixed  integer,  such  as  zero); 
l2)  Fixed  memory  (t'o  =  n  -  4/ t  1,  where  .1/  is  the  fi.xed 
memory  size) 

.-Mso,  ii  =  1  represents  a  nonlading  memory,  and  ;’  <  I  a  fading 
memory.  It  is  common  to  u-.e  ci'iier  a  li.xed-nonfadmg  mem¬ 
ory  or  a  growing  lading  memory. 

Note  that  vviih  the  delimiiun  in  (6‘)),  A’„|(n)l<l  for  all 
III  and  II.  Other  definitions  which  contain  the  energy  of 
eitlier  the  forward  or  'he  backward  residual  bul  not  both, 
require  fewer  computations  hut  do  not  guarantee  (S))  to 
hold  [ll|.  However,  tor  some  adaptive  applications.  (')) 
need  not  hold  ( 1  5  ]  . 


2  2.  7/«  - 1  I  (  - 1  H  1 1 


(6d| 


C(ii) 

Dun 


(70) 


(71) 


Many  adaptne  applications  use  a  growing. hiding  memory, 
i.c.,  to  is  fixed  and  d<  1-  hor  this  ,.ase.  we  have  Iroiii  (6di 
and  (20)  the  following  recursive  relations. 

Cfrr)  =  PCiii  -  1  )  (■  2/,„  .,(/i  •  i )  (72a) 

D(in  =  dDln  1) _,(/!)  t-, ?*,.,(/!  -  I  I.  (72b) 

The  adaptive  proceuuie,  (hen.  operates  as  I'rllo'Vs.  (jiven  tlie 
t'orward  and  backward  signals  ai  lime  ii .  ase  l”2)  and  I'Oi  lo 
compute  A',„(ri  t  1 )  for  1  cC/u  <p,  LTing  tiie  new  values  ot 
the  retlection  coeftlcients  and  any  ot  tlie  latti.e  structures 
given  in  Secuon  IV',  compute  the  forward  and  backwaru  sig¬ 
nals  at  time  ri  +  1 .  The  process  is  then  'epealed. 

The  fading  or  alteiinalion  tactoi  p  determines  the  '‘eff-ec- 
tivc"  memory  size  of  the  estimator.  When  d  is  close  to  ! 
the  adaption  is  siow.  and  is  last  tor  small  values  of  J.  In 
fact.  It  is  clear  from  (22)  tiiat  the  ctfect  of  ,d  is  ihal  of  a 
single-pole  low-p.iss  filtei  operating  on  Ciii]  and  Din):  d  is 
tlie  value  of  the  pole  in  the  c-plane. 

nquations  (6yi-(72)  can  be  used  with  any  oi  the  lattice 
structures  in  tliis  papei,  inchiJing  the  oiie-imi:i:plier  struc¬ 
tures.  In  the  lattci  case,  one  may  compute  liio  lato^e  mui'.i- 
piier  value  directly  from  Oni  ird  ZZi/n,  wiiriou!  nrst  comput¬ 
ing  iK.r..  Foi  c'.ample,  the  multiplier  of  l-ig  cia)  cen  be 
coinpuieu  as  loiiows: 

K,„  _  _  Ctii) 

1  •  K.„  Dm)  I  ('(/)) 

For  '.lie  special  ca-e  ot  it.ikun's  iwo-n.iul’ipFcr  ■' rnct'.irc  in 
Fia.  1.  one  can  si.ow  by  siini'le  .■iiaiupuiaiioii  fiat  io-ii  ^an  re 
w.'ii:-,-;;  III  die  !o:ir, 


-i<'(  ■  I 

K„,iii  +11  =  K,„in) - -  — ^ - — 


C.'l 


where  Din)  is  giver,  recursively  by  ('2bl  Equation  ('.')  -jV' 
that  each  ^oeflicieni  at  nine  n  +  i  is  obiameu  by  jduing  a 
correction  term  to  tlie  vsiue  at  time  n  One  .an  see.  for 
example,  that  for  a  grow ing-nun.fading  meir.oiy  (d=!i. 
Din)  increases  coniinu'.nisly  and.  theietore.  the  coircct;  ■; 
term  in  I  73)  lends  to  zero  as  n  goes  to  iniinity .  in  this  casC, 
A'„|  tends  to  Its  opt.mal  value  with  probability  1.  js'iimine 
a  stationary  signal.  It  is  interesting  to  note  tl;at  i'3l  repre¬ 
sents  essenii.iilv  a  stee’pest  descent  gradient  aig''r;;iim  toi  csti- 
niating  L',„  f'  w  d<  1.  and  appropriate  iiorn'.jl..tJtiun.  ■  ■•c 
can  show  that  I'.'t  is  similar  to  llie  laiiice  csiimste  of  (iiiiTnhs 
1 16)  .  and  tjc'coiiies  equal  lo  it  :n  tlie  s'eady  slate  with  a  gradi¬ 
ent  step  si.'e  e'lmd  to  1  -  d  W'hal  we  ha'-e  -hown  here  c-son- 
tuiiy  is  Tiat,  liuc  tii  ihe  enuiv aler.c-e  m  i'.'m  lo  ('■  M.  the  e-ti- 
inaie  in  (  73)  IS  alwjy  s  guaraiiieed  !o  obev  ("i. 

.As  a  result  ot  the  orthogiinaii/ation  .mu  dec-Hiniirig  pri’per- 
ties  of  ihe  lattice,  the  convergence  of  the  adaptive  latfce  is 
much  faster  than  tlial  for  the  cofresponding  .idapiivo  traii.o 
versal  filter.  Griffiths  llo)  has  note.!  rhat  dn'  .. ■r.ven.rer.ce 
time  of  the  lattice  is  .ilnnisl  independent  .il  die  eigenvalue 
spread  of  the  signal.  I.e.  independent  of  I's  spcctrci  dynamic 
range. 
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B.  1  Fcisl  S.’jrt  L  p  lututilizcr 

As  an  application  of  lattice  filters  to  adaptf.e  Wiener  filter- 
me.  'AC  present  the  Jesipn  of  a  new  fast  stait-ap  eipialiiter. 

Dunne  a  start-up  periiul  prior  to  dala  traiis'uis.ion ,  the  tap 
eoefiieients  of  an  adaptive  equalizer  are  adjusted  automatically 
to  optimum  values  tli.it  nii]iimiz.e  the  distortion  or  mean- 
square  error  of  the  received  pulses,  it  is  desirable  that  the 
start-up  time  used  for  tins  initial  adaption  process  be  as  small 
as  possit-le.  Ch.me  (I7|  proposed  .in  equalizer  structure  that 
reduces  the  start-up  time  drastic.ally.  The  eener.;!  form  of  the 
structure  is  shown  in  fij.  Hie  tap  coeffi.icnts  u,-  are  ad¬ 
justed  such  th.it  'lie  mean-square  error  between  )  md  some 
relerence  srrna!  is  mmmii/.ed.  If  the  filters  are  seh'cted  sueli 
that  tire  liau  ils  r.frtl  are  orilionoiiiul.  then  the  l.ip  ec'efficients 
c,  can  be  adjusted  to  their  optimum  values  in  one  step  [17] , 

Tiie  specific  fast  start-up  e(|ualizer  proposed  by  Cliauc  is 
shown  in  l-iit.  "s.  The  filter  signals  e,(/t)  are  obtained  from 
■Vln)  by  a  li.near  transforniatiun 

:  =  Px  (7-1) 

where 

:  =  (75) 

.V  Is  eiven  by  (eOh)  witli  p  replaeed  by  .V-  1.  ,ir.d  P  is  an 

.V  ,■<  ,\'  tram  foimation  matri\  ti)  be  deteimined, 

T  '  enable  fast  start-up.  tlie  matri.x  P  is  ciiosen  such  that 

[17] 

pnp‘'  =  i.  {lb) 

where  R  = /T'  is  '.lie  auioeorreiation  matrix  (order  iV  -  ))  of 
tile  sien.ii  .vtu)  [see  (26lj  .  md  /  is  the  unit  diaeoiial  matri.x. 
Tiie  iivnal  :rtin  is  taken  irere  to  be  (lie  response  ot  tlie  channel 
to  an  impulse  or  a  pseudoiandom  sequence.  The  solution 
chosen  fm  P  by  Chang  is 

P  =  D''<-Q'^  (77) 

wlicre 

!<  =  (JDQ^,  (7S) 

Q  is  a  matrix  wliose  columns  aio  the  orihonormal  eigenvectors 

ut  R.  and  I)  IS  a  diagonal  matrix  whose  elements  aio  the 
eigenvalues  of  R  From  (7"’)  and  (78),  and  nodng  ((’.at  Q''  = 

.  it  IS  clear  that  ilb)  holds  and  that  the  signals  C|(tt)  de- 
iined  by  ( 74)  and  ( 75)  are  oi  thonormal. 

Tne  structure  in  Fig.  8  requi.'-es  voei'ficieiits  vvith  an 
equal  number  of  multiplies.  Tins  number  van  trow  rapidly 
as  the  number  of  stages  iV  in  the  equalizer  is  increased.  Using 
spcvia!  properties  of  the  matrix  R  Cantoni  and  Butler  (18| 
siiowed  that  the  same  equalizer  stiucture  can  lie  im.plvuiiented 
using  .V^/2  .V  coefficients,  a  saving  of  about  one-half.  How¬ 
ever.  tlie  number  ol  eoefiieients  still  increases  as  ihe  square  oi 
tlic  number  of  stages.  I.'elow.  we  shall  show  tliat.  using  a 
lattice  structure  for  the  last  start-up  part  oi  tlie  equalizer,  the 
number  of  coefficients  becomes  a  linear  function  of.-V. 

Let 

/>= (7'V) 


,  .,(n) 

1  7.  (JcnerjlUcd  equalizer  siruciure. 


*  •  y  I' ) 


I'ig  t.  I  .ot  '.Uirt-iip  equalizer  structure  of  nuiiig  117]. 


wlier-;  C  and  .lie  given  by  id's)  and  (24).  respectively.  By 
u.-ing  (2"’),  i;  is  -i.iipie  to  show  that  P  in  (79)  obeys  (7p). 
as  dcsiied.  By  applying  P  in  I'  M  to  the  input  .v.  it  is  clear 
from  (“4)  aiig  igQi  tli.it  r  can  be  obtained  by  taking  the 
ortliogunal  signais  ,’„,l/il  alnng  ihe  backw  ard  path  of  the 
lattice  and  iiuiltipiy me  them  by  /fi,!"  to  normalize  them. 

To  ensure  last  start-up.  one  can  siiow  tliat  (“1)1  need  hold 
only  to  within  i  multiplicative  constant.  Thus,  instead  of 
ciioosiiig  P  .IS  in  (  ■'))  we  could  choose  P  to  equal 

FV'7,  C,.--,  (80) 

wT.cre  the  elements  .4  I’ are  the  normalized  residual  energies: 


(81) 


W.;  then  have  from  (80).  ,8  I  )  ind  (27): 

PRP‘' =  R{\))l.  (82) 

Ttierefore.  the  backward  signals  ?„.iui  are  multiplied  by 
to  normalize  them  to  witliin  a  constant  equal  to  'he  signal 
energy.  This  constant  can  then  easily  be  incorporateu  into 
the  idjiisimeiit  of  ibe  eoefiieients  c,  of  the  equalizer  proper. 

Fig.  9(a)  shows  tiie  general  lattice  structure  fur  the  proposed 
equalizer,  and  Fig.  ')(b)  shows  tiie  stiucture  when  the  lattice 
u  ies  one  of  the  normalized  forms  in  Figs.  5  and  o.  The  values 
ol  to  use  must  correspond  to  the  particular  lattice  form 
cliv'seti.  For  example,  if  one  uses  the  two. multiplier  form 
'howii  111  Figs.  1  mid  2,  tlicii  I„,  is  given  by  (26),  or  if  the 
one  multiplier  forms  shown  in  Figs.  3  and  4  are  employed, 
tlieii  as  derived  from  (56)  is  used.  On  the  other  hand, 
when  tlie  norin.ili/ed  forms  are  used,  I',,,  is  unity ,  as  shown 
m  Fig.  9(b). 

The  total  number  of  coefficients  emj'loy  ed  with  either  the 
two-miiltiplier  forms  or  the  three -multiplier  n(.irm.ili/ed  forms 
in  (-'ig.  6  is  3(/V-  1).  The  minimum  number  ol  eoeiTicionts 
is  ivhievcd  by  the  one-miiliiplier  forms.  ,ind  is  equal  to 
2(.V  I).  Theiet  'ie.  the  number  ol  voefficients  in  the  equal¬ 

izer  is  now  lineal  witii  the  number  of  stages,  which  makes 
et|uaiizers  witli  .i  large  number  of  slagcv  much  ciieaper  to  im¬ 
plement,  Furlhermore,  from  the  discussion  in  Section  it 
bcvomes  clear  that  the  lattice  equalizer  sliould  adapt  to  new 
channels  in  i  simple  and  efficient  manner. 


312 


It.KE  TRANSACTIONS  ON  ACOUSTICS.  SPrECIf.  AND  SIGNAL  PROCCSSINO.  VOL.  ASSP-26.  NO,  4.  AUGUST  19''R 


tint  .-4 


1  L.VTICE 
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1  '  1 

i 

L_li_ 

V*l/2 


z  i<ni  z  jCn 

&,  ft'. 


(S' 


6V 


,* - •/(n) 


(at 


NC='*-*.*.Lir:0  ' 

v'lDl  ••••  SO^'VAliZEC 

,(n)-«  1 

LATT-^e  ! 

[  !  j  uA^^lCb. 

:TAG£  1  1 

♦•i 

1  ■S'-'-TC  i?  S'4iS£  %-i 

r,(n)  Jj'n)  !  (n)  !^lnl 


(b) 


I'ig.  9.  (a)  General  lattice  fast  start-up  equaluer  structure,  (ti)  Lattice 
fast  start -up  cquali/er  using  notinalized  lattice  forms. 


C.  Discussion 
From  (76),  one  can  write 


The  specific  simuitaneous  impionieiitation  ol  Uie  nttninium- 
phase  and  nia.ximuni-piiase  lillers  in  the  lattice  leads  to  inter¬ 
esting  properties,  including  tlic  itnportant  ijrtliogonalization 
and  decoupling  propeities,  TiiC'C  properties  make  tJie  lattice 
especially  attractive  in  adaptive  line  it  prediction ,  or  in  other 
areas  of  Wiener  filtering  svhere  traroversaJ  or  FIK  filters  are 
used  in  an  adaptive  iiian.ner.  The  given  adaptive  algoritiiins 
for  the  estimation  of  tiie  reflection  coeificients  is  .een  to  be 
superior  to  similar  alcorilh.ms  using  a  transversal  filter  be¬ 
cause  of  tlie  inlierent  stability  of  llie  lattice  structure,  the 
fast  convergence  ol  lattice  algotitiinis.  and  the  relative  inde¬ 
pendence  ol  tlie  convergence  with  respect  to  tlie  signal  eigen¬ 
value  spread. 

We  also  showed  iiow  the  lattice  can  be  used  in  the  ctticicni 
design  of  a  fast  stiui-up  equalizer.  Tite  number  of  coefficients 
used  can  be  as  few  as  2.\' -  1.  wiiere  j\'  p  tlie  number  oi 
equalizer  taps,  compared  to  using  lha  cQuaiizer 

structure  of  Ciia.ng,  and  Canton!  and  Butler. 


R-'  =/"’> 


(83) 


.■\l’i'LNDlX  I 


Therefore,  the  problem  of  choosing  a  transformation  P  that 
renders  the  sicnals  r.i.q)  orthonormal,  can  be  seen  trom  iSci 
to  be  a  problem  in  the  factorization  ot  the  inverse  oi  a  covari¬ 
ance  matrix.  Assuming  R  to  be  positive  definite,  wliich  is 
generally  true  for  piiysical  signals,  tliere  exists  an  mlinity 
of  possible  solutions  tor  F.  Two  important  factorizations  were 
given  above.  Tiie  first  '.va.s  an  eigenvecior  factori/ation.  where 
tlie  oithoeonal  JecumpoMtiun  of  .v  into  e  is  olten  termed  a 
Karlmneri-Loe.e  decumpositicn.  Tlie  second  lactorization  is 
of  the  I.DL’  type,  where  L  is  lower  triangular  wiiii  I's  aiong 
live  main  diagonal.  Tins  factorization  can  be  sliown  to  be 
equivalent  to  a  Gri'ni-Sciiimdt  ortliogonalization  ol  v  into  :. 
Th.e  triangular  decomposition  results  in  a  matrix  P  with 
A'i.V-s  11  Z  elements,  as  compared  to  .V*  in  tiie  eigenvector 
decomposition ,  and  lienee  oilers  a  reduction  oi  about  one-liait 
in  the  number  of  eoefficienis.  in  addition,  it  offers  mb-iantial 
savings  in  computation  due  to  tlie  lact  that  trianguiar  tacton- 
zation  IS  inucli  simpier  and  less  expensive  than  computing 
eigenvalues  and  eieenvectc-rs.  However,  tile  major  benefit 
accrues  wTieti  R  is  a  symmetric  Toeplitz  matrix.  For  tins 
special  but  impoiiaiU  case,  tiie  triangular  decomposition  can 
be  implemented  m  a  laiiice  form,  with  tiio  number  ot  coelfi- 
cients  being  linear  wuii  .V.  What  tlie  lattice  eliectively  does 
is  peiform  the  Gram-Schnudt  orthogonalization  iccursivelyi 
each  stage  does  its  best  in  decorrei.iting  tiie  two  inputs  enter¬ 
ing  It.  Tiiere  are  cases  wiiere  R  is  not  Toeplitz.  but  wliero  a 
good  suboptimal  liiitice  solimon  may  stiil  he  toum!  (see  ( I  i  |  i. 

Finally,  we  point  out  tlial  the  structures  in  Fig,  9  can  be 
used  in  place  of  any  adi.ptive  transversal  Wiener  filter  m  older 
to  speed  up  tlie  convergence  ol  parameter  eslimation . 

VI.  (.'OSLI.VSIOS 

Based  on  tiie  two-mtiiliplier  all-zero  lattice  of  Iiakura  and 
Saito,  a  number  of  otliei  lattice  forms  vveie  derived.  Of  iliose 
fonns.  perliaps  tlie  one-rmdtipiier  is  of  greatest  inieresi.  be¬ 
cause  it  IS  canonical  in  the  number  of  multiplies. 

f’ropcitics  of  the  ail-z.eru  lattice  were  given  m  some  detail. 


We  wish  :o  sfioiv  ihat  12'^/  holds,  i.e.. 

CRC^=E  'Z') 

where  we  Ivve  dmpped  the  subscripts  and  superscripts  for 
eiarity.  Beeau.se  R  is  symnieinc  about  both  major  diagmials. 
one  can  rew  rite  ( Z6.i  I  as 


■ 

-  O' 

: 

1  1 ) 

= 

0 

i ' . 

_G. 

Tins  equation  applies  for  any  value  of  p,  and  ti.ereiore 


(A-1) 


P" 


LO 


,  ( 1 )  j;(Z)  ■  ■  •  epip) 

U;t  1  )  •  ■  tlpip  -  1  I 
1  ■  ■  ■  Cptp  -  Z) 


RC‘ 


J 

op 


(A-Z) 


vvlioie  ilic  lower  iniinimlar  pin  .V  is  possibh  nonz.'to  1.  Tun 
follows  that 

[RC']^  =  CR’'  =  CR  i.\-.'.' 

is  upper  iriani’-'ilar.  Now.  sin.e  hr'lli  C'R  .’nJ  (  '  I'ppe.- 
trijii'i'ilar.  tlieir  ptodiial  is  upp-.’r  Itiancalai  an,!  il.e  oieinents 
along  '.lie  mam  Ji.igona!  is  tlie  projuet  oi  the  di.gon.d  ele- 
ineiils  from  boih  ilielrices.  Tlicielou.. 
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Since  [C'.*jr'V  =  GVC' .  il.e  'veihict  matrix  is  symmetric 
and  tlie  part  Jenutcd  by  I  in  {A-4)  must  be  zero.  Thcretore, 


/'„t,i)x(/i)  =  g„,{n)x{n  -  m)  -  E„.  {B-8) 

Now,  making  use  of  (B-b)-(B-8)  ahn,  we  have  the  following 
two  properties: 

~  f''nux(i.;)  (B-9) 


CliC^  - 


Ari  l  N nix  II 

('dkkli  A rio.x  I’fttjcKRriES 

Below,  we  pic-cnt  some  properties  of  the  correlations  be¬ 
tween  the  forward  and  backward  residuals  in  the  lattice. 

From  ( )  and  ( o ) .  we  have 


k^O 


A:-0 


u,,..fO)=l.  (B-db) 

(B-db) 

as  in  (“).  'A'e  shall  assume  in  ilie  -.c  juel  tiift  ih.e  predictor 
coefficients  J„,i.i.l  ate  those  tltiit  ntmimize  'lie  energy  in  the 
torward  an  i  backward  residuals,  and  hence  oi-ey  (24)  and 
fdS).  which  V.C  re'.vrito  here  as 


^  -  k)  =  0.  1 

k’O 


,j„,(k)Rik)^E„,,  (B-4) 

(t  =0 

where  E„,  is  the  minimum  eneray  at  stage  m,  and  Rik)  is  the 
autocorielatioti  of  ihe  signal  .v(/i)  leimed  hy 

.t(n  -  i)x(ii  -  j)  =  Rii  -  i).  (B-5) 

Most  of  the  properties  given  below  make  simple  rise  of  the 
equations  above,  lienee,  no  detailed  derivations  are  given. 

,1.  Properties 

The  first  two  properties  are  restatements  of  the  orthogonal¬ 
ity  condition  ( B-3}: 

/,„(«).<•(//- I)  =  0.  \<i<m.  (B-6) 

,g„(n)A:(n  -  f)  =  0.  0<i</n-l.  (B-7) 

Note  the  difference  between  the  ranges  of  i  in  (  B-h)  and  ( B-7). 
Using  (B-4),  we  have 


lO, 

Equation  (B-10)  is  identical  to  (33).  Note  from  (B-9)  that  tlie 
forward  residuals  do  not  e.xhibit  the  same  orthogonality  prop¬ 
erty  as  the  backward  residuals.  However,  there  is  a  certain 
duality  between  the  forward  and  backward  residuals,  first 
exhibited  in  (3-1)  and  then  in  (B-8).  We  shall  see  now  that 
(3-9)  and  (B-10)  also  have  duals. 

From  (B-b)  and  fB-7),one  can  establish  the  following  condi¬ 
tions  for  the  orthogonality  between  delayed  versions  of  the 
residuals: 


;»//«- r)  =0  for 


.?,('!).?/(«- 0  =  0  for 


1  <  r  <  i  -  /,  1  >/ 

-l>r>i-/,  i</  (B-11) 

0<r<f-/-l,  /■>/' 

0  >  r  >  i  -  /'  +  1 ,  I  </ 


where  r  is  an  integer  lag.  The  only  value  ol  r  in  fB-ld)  that 
IS  the  same  for  i  >/  and  i  <'/  is  r  =  0.  For  r  =  0.  we  already 
have  (B-9)  and  (B-10).  The  duals  stem  from  (B-11),  where 
the  only  value  of  r  mat  is  tne  same  for  i  >/  and  i  < /  is  f  = 
t  - Making  use  of  the  general  relation  t('/)vv(ri  -  r)  = 
cOi  +  r)H(ri ),  we  have  for  t  -  i  -  j 

/,(«  +  '}//("  +/)  =  (B-13) 

gift!  +  ngjUt  +J)  =  F'maxli,/)- 

Equations  (B-1 3)  and  ( B-1 4)  arc  the  duals  of  ( B-9)  and  ( B-10). 
(B-i3)  is  the  orthogonality  equation  for  the  foiward  residuals. 

.Making  use  of  (B-db)  one  can  derive  the  foilowing  cross- 
correlation  property: 


-  \K,Ei,  i>/,  , 

f,(n)gj(n)  =  \  ij>0  (B-1 

10,  /■</, 

where  we  have  assumed  that  A'o  =  1 . 

From  (15)  and  (34),  we  have 

fiODgiUi  -  I)  =  <B-1' 

From  (B-1 6)  one  can  show  that 

gi(n  -  l).v(ri)  =},(n  +  l).v(/i  -  /)  =  -  F'i-  (B-F 

In  general,  we  have  the  t'ollowing  cross-cottelaiioii  property: 
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_  f  0,  l>/, 

fi(ii)KAn  '  1)=  3  i.i>0.  (B-I8) 

[■Kj.it'i,  i</. 

Note  the  diflerencc  between  (B  15)  and  (B-18). 
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A  mixed-source  model  for  speech  compression  and  synthesis 
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This  paper  presents  an  excitation  source  model  for  speech  compression  and  synthesis  that  allows  the 
degree  of  voicing  to  be  varied  continuously  by  mixing  voiced  (pulse)  and  unvoiced  (noise)  excitations  in  a 
frequency-selective  manner.  The  mix  is  achieved  by  dividing  the  speech  spectrum  into  two  regions,  with 
the  pulse  source  exciting  the  low-frequency  region  and  the  noise  source  exciting  the  high-frequency  region. 

The  degree  of  voicing  is  specified  by  a  parameter  F,,  which  corresponds  to  the  cut-off  frequency  between 
the  voiced  and  unvoiced  regions.  For  speech  compression  applications.  F,  can  be  extracted  automatically 
from  the  speech  spectrum  and  transmitted  Experiments  performed  with  the  new  model  indicate  its  power 
in  synthesizing  natural  sounding  voiced  fricatives  and  in  largely  eliminating  ihe  "buzzy”  quality  of 
vocoded  speech  A  functional  definition  of  buzziness  and  naturalness  is  given  in  terms  of  the  model. 

PACS  numbers:  43.70.  Lw,  43.70.Jt 


INTRODUCTION 

Perhaps  the  single  most  important  decision  to  be  made 
in  a  pttch-exctted  speech  compression  system  (vocoder) 
is  the  voiced/unvoiced  (V/U)  decision.  Errors  made  in 
this  decision  will  readily  be  perceived  by  the  ear  as  a 
degradation  of  speech  quality,  which  may  also  be  ac¬ 
companied  by  a  loss  in  intelligibility.  Yet,  even  if  the 
V/U  decision  were  somehow  to  be  made  "perfectly,  ”  the 
resynthesized  speech  would  still  lack  naturalness,  be¬ 
cause  of  various  other  factors.  In  this  paper,  we  shall 
deal  with  those  aspects  of  naturalness  that  can  be  influ¬ 
enced  by  the  source  e.xcitation,  particularly  “buzziness” 
and  ‘Tack  of  fullness”— two  qualtities  often  associated 
with  vocoded  speech,  especially  for  speech  generated  by 
a  linear  prediction  (LPC)  vocoder. 

There  are,  of  course,  different  types  of  buzziness 
caused  by  different  factors,  not  all  of  which  are  related 
to  the  source  excitation.  In  the  work  described  in  this 
paper,  however,  we  have  investigated  the  degree  to 
which  buzziness  may  be  reduced  by  source  manipula¬ 
tion,  regardless  of  the  cause  of  the  buzziness. 

In  this  paper,  we  explore  the  excitation  problem  in 
speech  synthesis  and  present  a  simple  mixed-source 
model  that  allows  for  a  degree  of  voicing.  The  new 
model  is  capable  of  producing  more  natural-sounding 
speech;  it  seems  to  eliminate  most  of  the  problem  of 
buzziness  and  recover  part  of  the  fullness  of  natural 
speech.  In  addition,  the  model  promises  to  reduce  the 
adverse  effects  of  voicing  errors.  Other  mixed-source 
models  have  been  used  previously  in  speech  compression 
and  synthesis  (Holmes,  1973;  Fujimura,  1966,  1968; 
Itakura  and  Salto,  1968;  Kato  et  al. ,  1967;  Coulter, 

1975;  Klatt,  1976;  Strube,  1977).  A  review  of  previous 
research  relating  to  the  model  presented  in  this  paper 
is  given  in  Sec.  VII. 

I.  BASIC  SYNTHESIS  MODEL  AND  TERMINOLOGY 

Throughout  this  paper,  we  shall  assume  the  basic 
synthesis  model  shown  in  Fig.  1.  In  this  model,  a 
time-varying  excitation  signal  excites  a  time-varying 
spectral  shaping  filter  to  produce  the  synthetic  speech. 
The  excitation  signal  is  assumed  to  have  a  flat  spec¬ 


trum,  so  that  the  spectral  envelope  of  the  synthetic 
speech  is  determined  completely  by  the  spectral  shap¬ 
ing  filter.  Furthermore,  we  shall  assume  this  model 
to  hold  for  any  type  of  synthesis,  whether  as  part  of  a 
vocoder  system  or  a  speech  synthesis  system.  In  fact, 
we  will  argue  that  our  proposed  source  model  is  indeed 
adequate  for  both  applications. 

Assuming,  then,  that  the  excitation  has  a  flat  spec¬ 
trum,  we  are  necessarily  limited  to  two  types  of  excita¬ 
tion:  deterministic  (pulse)  or  random  (noise). 

A.  Pulse  source  (buzz) 

The  deterministic  excitation  used  is,  in  general,  the 
impulse  response  of  an  all-pass  filter,  which  we  shall 
call  an  all-pass  signal  or  pulse.  The  simplest  form  of 
an  all-pass  pulse  is  a  single  impulse.  When  the  pulse 
source  produces  a  sequence  of  pulses  separated  by  a 
pitch  period,  it  is  knoivn  as  a  buzz  source.  (Note  that  a 
single  pulse  could  be  used  in  the  synthesis  of  the  burst 
in  a  plosive  sound  (Holmes,  1973).  However,  the  burst 
can  also  be  synthesized  using.the  noise  source.  We 
shall  assume  the  latter  in  this  paper;  the  pulse  source 
will  be  used  exclusively  for  buzz  excitation.) 

B.  Noise  source  (hiss) 

The  random  noise  excitation  may  be  the  output  of  a 
random  number  generator.  Generators  with  cither  a 
uniform  or  a  Gaussian  probability  distribution  are  readi¬ 
ly  available  and  appear  to  be  quite  adequate.  The  noise 
source  is  also  known  as  a  hiss  source. 

Whether  the  actual  excitation  is  buzz,  hiss,  or  a  com¬ 
bination  of  the  two,  it  is  important  that  the  excitation 
have  a  flat  spectrum.  We  next  describe  how  one  might 
derive  an  appropriate  source  model  by  inspecting  short- 
time  speech  spectra. 


FLAT  SPECTRUM 

EXCITATION 

SPECTRAL 

SHAPING 

FILTERHC/) 

SYNTHETIC 

GENERATOR 

SIGNAL 

utt) 

SPEECH 

FIG.  1.  Basic  sjnthesis  model . 
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FIG.  2.  Inverse  filtering  the  speech  signal  to  obtain  a  residual 
signal  with  a  flat  spectrum. 


II.  THE  "IDEAL"  SOURCE 

For  some  particular  speech  signal,  one  can  remove 
the  short-time  spectral  envelope  by  appropriate  inverse 
filtering,  as  shown  In  Fig.  2.  ThelnversefllterA(z)can 
be  obtained  by  cepstral  techniques  (Oppenhelm  and  Scha¬ 
fer,  1975)  or  through  the  use  of  linear  prediction 
(Makhoul,  1975).  The  residual  signal  e{t)  will  then  have 
a  nominally  flat  spectrum.  If,  In  Fig.  1,  the  excitation 
u{t)  can  be  made  identical  to  the  residual  e{t)  and  the 
synthesis  filter  H{z)  is  the  inverse  of  A{z),  the  recon¬ 
structed  synthetic  speech  r{t)  will  be  identical  to  the 
original  input  signal  s(t). 

However,  for  general  synthesis  purposes,  the  syn¬ 
thetic  signal  need  only  sound  like  the  original;  it  need 
not  be  identical  to  it.  In  addition,  in  synthesis  we  need 
to  manipulate  the  source  pitch,  and  In  vocodlng  we  need 
to  minimize  the  number  of  bits  required  to  represent 
the  source.  To  accomplish  these  tasks,  we  make  use 
of  an  important  property  of  speech  perception:  its  rela¬ 
tive  insensitivity  to  the  short-time  phase.  Therefore, 
to  model  the  residual  eU)  to  meet  our  requirements,  we 
need  only  look  at  Its  spectrum  and,  except  for  pitch, 
disregard  its  phase  structure  for  the  moment. 


aperiodic  bands  In  5  kHz.  For  more  examples,  the 
reader  is  referred  to  the  work  of  Fujimura  (1968). 

Partial  devoicing  of  certain  sounds  is  well  known  from 
physical  considerations.  For  example,  [z\  is  devoiced 
above  about  1  kHz,  and  several  attempts  at  the  synthesis 
of  more  natural  voiced  fricatives  have  made  use  of  this 
fact.  On  the  other  hand,  it  is  also  known  that  in  the 
production  of  the  tense  front  vowel  fil,  the  constriction 
may  become  narrow  enough  to  generate  some  turbulence, 
which  results  in  devoicing  of  frequencies  above  about 
3  kHz.  To  date,  however,  most  synthesizers  have  not 
taken  advantage  of  this  fact. 

In  addition  to  the  foregoing  examples  of  devoicing, 
Fujimura  (1968)  has  hypothesized  that  devoicing  of  some 
spectral  regions  may  be  due  to  aperiodicities  or  irregu¬ 
larities  in  vocal-cord  movement.  We  have  observed 
that  spectral  devoicing  often  occurs  during  tr.insitions 
between  different  sounds,  including  sonorant-sonorant 
transitions.  In  contrast  lo  the  examples  given  in  the 
previous  paragraph,  we  believe  that  the  spectral  de¬ 
voicing  due  to  vocal-cord  irregularities  and/or  spectral 
transitions  may  in  fact  be  an  artifact  of  the  .spectral 
estimation  process.  It  is  not  clear  whether  it  is  ap¬ 
propriate  to  use  a  noise  .source  for  synthesizing  such 
devoiced  regions. 

In  conclusion,  residual  .spectra  may  be  completely 
periodic  (voiced),  completely  aperiodic  (unvoiced),  or 
may  contain  some  spectral  regions  that  are  periodic 
and  others  that  are  aperiodic.  We  discuss  next  how  such 
spectra  can  best  be  modeled  using  the  buzz  and  hiss 
sources. 


Figure  3  shows  the  signal  power  spectrum  of  25.6  ms 
of  a  10-kHz  sampled  signal  in  the  middle  of  the  vowel 
[l|  in  the  word  "list,”  and  the  corresponding  residual 
spectrum.  The  residual  was  obtained  by  inverse  filter¬ 
ing  the  speech  signal  with  a  20th-order  linear  prediction 
inverse  filter.  If  one  could  generate  an  excitation  u{t) 
whose  spectrum  is  identical  to  the  residual  spectrum, 
the  synthetic  speech  might  then  sound  almost  the  same 
as  the  original. 

Therefore,  our  aim  in  f'eveloping  a  source  model  is 
to  obtain  an  excitation  spectrum  that  is  as  close  as  pos¬ 
sible  to  the  residual  spectrum.  Furthermore,  in  ob¬ 
taining  such  a  spectrum,  we  want  to  use  only  the  buzz 
and  hiss  sources  described  in  Sec.  I.  The  source  mod¬ 
el  will  follow  naturally  from  the  characteristics  of  resid¬ 
ual  spectra. 

III.  CHARACTERISTICS  OF  RESIDUAL  SPECTRA 

The  residual  spectrum  in  Fig.  3  shows  a  clear  peri¬ 
odicity  up  to  about  3.  5  kHz  and  a  lack  of  periodicity 
above  that  frequency.  The  periodicity  is  shown  by  the 
regularly  spaced  peaks  in  the  spectrum  that  correspond 
to  the  harmonics  of  the  voice  fundamental  frequency. 
Inspection  of  residual  spectra  of  other  sounds  shows 
that  the  existence  of  aperiodic  frequency  bands  in  sono- 
rant  sounds  is  quite  common.  While  only  hvo  bands  can 
be  identified  in  Fig.  3,  one  periodic  and  one  aperiodic, 
it  is  possible  to  have  several  adjacent  periodic  and 


IV.  PROPOSED  SOURCE  MODEL 

One  reasonable  source  model  would  divide  the  spec¬ 
trum  into  a  number  of  bands.  Each  band  considered  to 
be  periodic  would  be  excited  by  the  buzz  source,  and 
each  band  considered  lo  be  aperiodic  would  be  excited 
by  the  hiss  source.  Fujimura  (1968)  used  a  three-band 
model  in  his  experiment  with  a  channel  vocoder,  and  re¬ 
ported  an  improvement  in  speech  naturalness.  However, 
given  our  observations  that  .spectral  aperiodicities  may 
not  necessarily  result  from  turbulent  excitations,  wo 
have  chosen  a  simpler  model.  We  treat  as  periodic  all 


FIG.  3.  Signal  s[)ectrum  (top)  and  residual  spectrum  (bottom) 
for  the  vowel  (I|  in  the  word  "list." 
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FIG.  4.  Frer|uency-selective  mixed- 
source  excitatioo  model. 


spectral  aperiodic  regions  that  are  in  between  two 
periodic  regions.  In  other  words,  only  the  band  above 
the  highest  frequency  region  that  is  periodic  will  be 
treated  as  aperiodic  and  will  be  excited  by  a  turbulent 
source.  There  are  two  reasons  for  this  choice:  (a)  Tur¬ 
bulent  sources  are  more  likely  to  excite  high  frequencies 
than  low,  and  (b)  excessive  devoicing  can  degrade  qual¬ 
ity  just  as  severely  as  excessive  voicing. 

The  resulting  model  is  shown  in  Fig.  4.  It  is  a  mixed- 
source  model  where  the  buzz  source  excites  a  time- 
varying  low-frequency  region  of  the  spectrum,  and  the 
hiss  source  excites  the  remaining  high-frequency  re¬ 
gion.  The  excitation  signal  is  realized  by  passing  the 
pulse  excitation  through  a  low-pass  filter  with  cutoff  Fj, 
and  the  noise  excitation  through  a  high-pass  filter  with 
the  same  cutoff  frequency  F^,  then  adding  the  outputs  of 
the  two  filters.  The  resulting  mixed-source  excitation 
signal  is  multiplied  by  the  source  gain  and  applied  to 
the  spectral  shaping  filter.  The  model,  then,  has  only 
two  parameters:  the  cutoff  frequency  F^,  and  the  pitch 
period  t  when  Fj>0.  Since  small  changes  in  F,.  do  not 
seem  to  be  perceptible,  it  is  sufficient  to  quantize  F^ 
into  2-3  bits  for  transmission  purposes. 

V.  IMPLEMENTATION 
A.  Extraction  of  source  parameters 

The  only  novelty  in  parameter  e.xtr action  for  the  new 
source  model  is  that  the  traditional  binary  V/U  decision 
has  been  replaced  by  the  determination  of  a  multivalued 
parameter  F^.  The  extraction  of  the  pitch  period  is  not 
affected,  and  we  shall  not  address  that  problem  here. 

The  question,  of  course,  is  how  to  compute  Fj  in  a 
perceptually  satisfactory  manner.  One  of  the  algorithms 
we  had  considered  but  rejected,  computed  F^  from  the 
autocorrelation  of  the  residual.  We  had  hoped  that  the 
normalized  autocorrelation  of  the  residual,  at  a  lag 
equal  to  the  pitch  period,  could  be  calibrated  to  indicate 
the  degree  of  voicing.  However,  it  became  clear  that 
the  value  of  the  autocorrelation  at  the  pitch  period  was 
complicated  by  the  interaction  between  the  data  window 
and  the  pitch  period.  Therefore,  we  abandoned  this  line 
of  attack  in  favor  of  the  method  described  below. 


The  method  we  have  chosen  thus  far  is  a  peak-picking 
algorithm  on  the  signal  spectrum.  The  algorithm  de¬ 
termines  periodic  regions  of  the  spectrum  by  e.\amining 
the  separation  between  consecutive  peaks  and  determin¬ 
ing  whether  the  separations  are  the  same,  within  some- 
tolerance  level.  Fj  is  taken  to  be  the  highest  frequency 
at  which  the  spectrum  is  considered  to  be  periodic. 

For  each  speech  frame,  we  obtain  a  pitch  frequency 
estimate  using  a  modified  version  of  a  recently  devel¬ 
oped  harmonic  pitch  e.xtractor  fSeneff,  1977),  which 
also  employs  peak  picking  of  the  signal  spectrum  and 
therefore  fits  very  naturally  into  our  scheme.  Using 
this  pitch  frequency  estimate,  the  algorithm  tests 
whether  the  separation  between  adjacent  peaks  are  with¬ 
in  some  tolerance  from  the  pitch  estimate.  F^  is  then 
that  frequencj  beyond  which  the  separation  between 
peaks  do  not  fall  within  the  given  tolerance  levels.  The 
algorithm  includes  heuristics  that  take  into  account  oc¬ 
casional  regions  of  aperiodicity  as  v/ell  as  shifts  in 
pitch  frequency  from  one  spectral  region  to  another. 

(The  latter  phenomenon  might  be  attributable  to  the 
changing  pitch  frequency  within  one  analysis  frame.) 

After  e.xtracting  F^  for  consecutive  frames,  a  three- 
point  median  smoother  is  applied  to  the  computed  cutoff 
frequency  values.  This  smoothing  technique  corrects 
the  cutoff  frequency  upward  whenever  it  is  lower  than 
two  adjacent  values  and,  therefore,  eliminates  spurious 
low  cutoff  frequency  values. 

B.  Filter  implementations 

In  our  initial  implementation,  we  rounded  the  value  of 
Fj  to  the  nearest  500  Hz.  Therefore,  we  needed  lowpass 
and  highpass  filters  with  cutoff  frequencies  separated 
by  500  Hz.  The  filter  designs  were  then  stored  and 
used  in  the  synthesis  as  the  need  arose. 

For  each  value  of  F^,  the  3-dB  points  for  the  lowp.nss 
and  highpass  filters  were  designed  to  be  equal  to  F^,  so 
that  the  spectrum  of  the  final  excitation  may  be  as  flat 
as  possible.  The  rolloff  of  the  filters  was  considered  to 
be  of  secondary  importance;  filters  with  transition  re¬ 
gions  of  about  1000  Hz  were  felt  to  be  suitable.  We  con¬ 
sidered  FIR  (finite  impulse  response)  as  well  as  rocur- 
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sive  (low  order  Butterworth)  filters.  The  two  types  of 
filters  gave  similar  perceptual  results. 

VI.  RESULTS 

Using  the  implementation  described  in  Sec.  V,  we 
compared  the  resulting  synthesized  speech  to  speech 
that  employed  the  binary  V/U  model  in  the  context  of 
a  linear  prediction  (LPC)  vocoder.  A  number  of  sen¬ 
tences  from  male  and  female  speakers  (Huggins,  Vis- 
wanathan,  and  Makhoui,  1977)  were  used  in  comparing 
the  two  analysis -synthesis  systems.  No  quantization 
of  parameters  (except  for  F^)  was  performed.  One  of 
the  sentences  had  a  concentration  of  fricative  sounds; 

“His  vicious  father  has  seizures”;  and  another  was  a 
nonnasal  sonorant  sentence:  "Why  were  you  away  a 
year,  Roy?"  Other  sentences  were  more  general.  With 
the  v/U  source,  the  fricative  sentence  rounded  particu¬ 
larly  buzzy  for  both  male  and  female  speakers,  while 
the  sonorant  sentence  was  judged  as  buzzy  only  for  low- 
pitched  male  speakers.  The  buzziness  in  both  sentences 
was  greatly  reduced  when  using  the  mixed-source  mod¬ 
el.  In  general,  the  buzziness  was  always  reduced  with 
the  new  model.  However,  for  some  sentences  the  new 
synthesis  produced  certain  small  background  noises. 

Upon  careful  listening,  it  was  determined  that  some  of 
these  noises  were  also  present  in  the  V/U  synthesis, 
but  were  masked  by  the  buzziness.  The  other  noises 
may  be  due  to  inaccurate  determination  of  and/ or  to 

the  particular  implementation  of  the  model. 

Overall,  listeners  thought  that  the  new  model  per¬ 
formed  better  for  female  speakers.  The  new  synthesis 
was  “raspier”  and  more  characteristic  of  female  speech, 
which  is  considered  to  be  more  breathy  than  male  speech. 
A  number  of  listeners  reported  that  the  new  synthesis 
had  a  certain  “fullness”  that  was  absent  with  the  V/U 
synthesis.  -We  interpret  this  as  an  indication  of  the 
greater  naturalness  resulting  from  the  new  model. 

Formal  testing  of  our  implementation  of  the  mixed- 
source  model  is  planned.  The  results  will  be  reported 
in  a  subsequent  paper. 

VII.  REVIEW  OF  RELATED  WORK 

We  know  of  only  one  other  work  in  which  mixed  exci¬ 
tation  was  used  with  LPC  vocoders:  that  of  Itakura  and 
Saito  (1968).  There,  however,  the  two  sources  excited 
the  whole  spectrum  simultaneously,  with  the  “degree” 
of  voicing  being  controlled  by  the  relative  amplitudes  of 
thesources.  The  results  were  not  encouraging  (Itakura). 

After  the  development  of  our  model  over  two  years 
ago,  we  became  aware  of  Fujimura’s  work  (1966,  1968). 
As  far  as  we  know,  he  was  the  first  to  suggest  and  test 
a  frequency-selective  mixed-source  model.  His  work, 
which  we  mentioned  earlier,  was  performed  in  the  con- 
te.xt  of  a  pitch-excited  channel  vocoder.  Fujimura 
brought  to  our  attention  his  other  work  with  Kato  et  at. 
(1967),  which  employed  a  variable  cutoff  frequency  like 
ours,  but  used  a  different  algorithm  to  determine  the 
cutoff.  The  work  was  done  with  a  hybrid  voice-excited 
and  pitch-excited  channel  vocoder,  and  the  researchers 
reported  excellent  results.  Cutler  (1975)  used  mi.xed 


excitation  for  the  synthesis  of  voiced  fricatives;  however, 
the  cutoff  between  the  low-  and  high-frequency  bands 
was  fixed. 

In  speech  synthesis,  mixed  excitation  has  been  used 
routinely  for  the  syntliesis  of  voiced  obstruents  (see, 
for  example.  Holmes,  1973;  Klatt,  1976).  The  parallel 
formant  synthesizer  of  Holmes  (1973)  allows  for  vari¬ 
able  mixed  excitation,  and  was  especially  used  in  tran¬ 
sitions  between  unvoiced  and  voiced  sounds.  Upon  care¬ 
ful  reading,  it  became  clear  to  us  that  the  spirit  of 
Holmes’  synthesizer  is  similar  to  ours,  except  that,  in 
his  case,  the  controls  are  more  complicated.  A  more 
recent  hardware  synthesizer  by  Stnibe  (1977)  allows  for 
mixed  e.xcitation  using  a  single  variable  RC  circuit. 

There  have  been  numerous  attempts  at  reducing  buzz¬ 
iness  by  changing  the  shape  of  the  pulse  in  voiced  exci¬ 
tation,  and  results  have  been  mixed.  Recently,  Sambur 
et  al.  (1977)  reported  a  reduction  in  vowel  buzziness 
(exhibited  mainly  with  low-pitch  voices)  by  changing  the 
pulse  width  to  be  proportional  to  the  pitch  period.  Un¬ 
fortunately,  changing  the  pulse  width  changes  the  exci¬ 
tation  spectrum;  the  effect  is  that  of  a  variable  lowpass 
filter.  Spectrally  flattening  the  pulse  before  excitation 
seemed  to  cancel  the  reduction  in  buzziness  (Atal,  1977). 

VIII.  DISCUSSION 

A.  Buzziness  and  naturalness 

It  is  interesting  that  the  mixed-source  model  appear.-, 
to  reduce  two  seemingly  different  types  of  buzziness: 
the  buzziness  in  voiced  fricative  synthesis  and  the  buzz¬ 
iness  in  sonorant  synthesis  associated  mainly  with  low- 
pitched  voices.  Our  hypothesis  is  that  the  two  types  of 
buzziness  may,  in  fact,  result  from  the  same  process: 
that  of  an  excess  in  buzz  .source  excitation.  Thus,  our 
general  rule  is  that: 

Too  much  buzz  results  in  “buzziness.” 

Too  much  hiss  results  in  “breathiness”  or  “raspiness.” 

If  more  of  the  spectrum  is  excited  by  the  buzz  source 
than  is  necessary  for  naturalness,  the  result  is  buzzi¬ 
ness.  Similarly,  if  there  is  more  hiss  excitation  than 
is  necessary  for  naturalness,  the  result  is  breathiness 
or  raspiness.  These  observations  lead  us  to  a  functional 
definition  of  one  aspect  of  natur.nlness,  as  it  relates  to 
mixed  excitation: 

Naturalness  is  maximized  by  that  proper  mix  of 

buzz  and  hiss  excitations  that  leads  to  a  synthesis  that 

is  neither  buzzy  nor  breathy  (or  raspy). 

We  must  emphasize  here  that  the  mixed-source  model 
may  not  remove  all  types  of  buzziness.  In  fact,  it  is 
conceivable  that  for  certain  types  of  buzziness,  such  as 
in  sonorant  sounds,  the  mixed  source  may  be  simply 
masking  the  buzziness  by  substituting  hiss  for  buzz  at 
high  frequencies.  That  such  a  solution  seems  to  suc¬ 
ceed  in  eliminating  the  buzziness  does  not  necessarily 
mean  that  other  solutions  may  not  be  more  appropriate. 
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B.  Modulation  and  naturalness 

Certain  synthesizers,  such  as  that  of  Klatt  (1976), 
amplitude- modulate  the  hiss  source  by  the  buzz  source 
for  the  synthesis  of  voiced  fricatives.  While  it  is  known 
that  the  noise  source  in  the  vocal  tract  is  in  fact  modu¬ 
lated  by  the  vocal  cord  output,  it  is  not  clear  that  such 
modulation  is  necessary  for  achieving  naturalness  in 
synthetic  speech.  Whatever  effect  modulation  has,  it 
appears  to  be  of  a  secondary  nature.  Holmes  (1973) 
reported  very  natural  speech  for  his  synthesizer,  which 
did  not  contain  any  modulation.  Although  we  initially 
included  modulation  in  our  model,  it  is  at  present  our 
opinion  that  source  modulation  may  not  be  necessary  for 
natural  synthesis,  and  therefore  we  have  decided  not  to 
incorporate  it  as  part  of  the  model. 

C.  Phase  and  naturalness 

It  is  generally  agreed  that  proper  phase  determina- 
ation  of  buzz  excitation  should  lead  to  more  natural 
synthesis.  Furthermore,  such  phase  cannot  be  in  the 
form  of  some  “optimal”  pitch  pulse  shape.  The  phase 
must  change  from  one  pitch  pulse  to  the  next  in  some 
appropriate  manner.  Thus  far,  our  model  calls  for  an 
all-pass  pulse,  but  does  not  specify  the  phase.  Exactly 
how  the  phase  should  change  between  pulses  is  a  subject 
for  futjre  research. 

IX.  CONCLUSION 

We  have  presented  a  frequency -selective  mixed- 
source  excitation  model  for  use  in  both  speech  compres¬ 
sion  and  speech  synthesis.  The  model  has  a  single  con¬ 
tinuous  parameter,  F^,  which  effectively  divides  the 
spectrum  into  two  regions.  A  buzz  source  excites  the 
low-frequency  region  below  F^,  and  the  hiss  source 
excites  the  high-frequency  region  above  Naturalness 
(no  buzziness  or  breathiness)  is  maximized  by  the  pro¬ 
per  mix  of  the  two  sources,  i.  e.,  by  the  proper  deter¬ 
mination  of  Fj.  For  sentences  where  F^  was  properly 
extracted,  the  speech  sounded  much  more  natural.  Im¬ 
proper  determination  of  Fj  may  lead  to  either  buzziness 
or  breathiness. 

While  other  aspects  of  the  synthesis  process  may  be 
more  important  in  produci.ag  greater  naturalness,  we 
believe  that  any  model  that  attempts  to  achieve  natural¬ 
ness  must  include  in  it  some  type  of  mixed  excitation. 
The  mixed-source  model  presented  in  this  paper  is  a 
step  in  that  direction. 
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ABSTRACT 

The  traditional  method  of  high-frequency 
regeneration  (HFR)  of  the  excitation  signal  in 
baseband  coders  has  been  to  rectify  the  transmitted 
baseband,  followed  by  spectral  flattening.  In 
addition,  a  noise  source  Is  added  at  high 
frequencies  to  compensate  for  lack  of  energy  during 
certain  sounds.  In  this  paper,  we  reexamine  the 
whole  HFR  process.  We  snow  that  the  degree  of 
rectification  does  not  affect  the  output  speech, 
and  that,  with  proper  processing,  the 
high-frequency  noise  source  may  be  eliminated.  We 
introduce  a  new  type  of  HFR  based  on  spectral 
duplication  of  the  baseband.  Two  types  of  spectral 
duplication  are  presented:  spectral  folding  and 
spectral  translation.  Finally,  in  order  to 
eliminate  the  problem  of  breaking  the  harmonic 
structure  due  to  spectral  duplication,  we  propose  a 
pitch-adaptive  spectral  duplication  scheme  in  the 
frequency  domain  by  using  adaptive  transform  coding 
to  code  the  baseband. 


1.  INTRODUCTION 

Baseband  coders,  or  what  are  known  also  as 
voice-excited  coders  [1-3],  were  originally 
proposed  as  a  compromise  between  pitch-excited 
coders  (such  as  LPC,  channel  and  homomorphic 
vocoders)  and  waveform  coders.  Today,  baseband 
coders  offer  attractive  alternatives  at  data  rates 
in  the  ra.ige  6.11-9.6  kb/s.  This  range  of  data 
rates  has  become  increasingly  Important  because 
modems  are  now  available  that  operate  reliably  in 
that  range  over  regular  telephone  lines. 

Below,  we  shall  assume  that,  at  the  receiver, 
the  synthesizer  obeys  the  general  synthesis  model 
shown  in  Fig.  1.  In  the  figure,  the  synthetic  or 
reconstructed  speech  signal  r(t)  is  generated  as 
the  result  of  applying  a  time-varying  excitation 
signal  u(t)  as  input  to  a  time-varying  spectral 
shaping  filter.  The  spectral  envelope  of  the 
excitation  is  assumed  to  be  flat,  so  t.hat  the 
spectral  envelope  of  the  synthetic  speech  is 
determined  completely  by  the  spectral  shaping 
filter.  The  parameters  of  the  model,  i.e.,  the 
excitation  and  the  filter,  must  be  computed  and 
transmitted  periodically  by  the  transmitter.  Those 
parameters  that  represent  the  speech  spectrum, 
denoted  as  spectral  parameters,  are  computed, 
quantized  and  transmitted  every  15-30  ms.  In  a 
baseband  coder,  a  low-frequency  portion  of  the 
excitation,  known  as  a  baseband .  is  transmitted  and 
used  at  the  receiver  to  regenerate  the 
high-frequency  portion  of  the  excitation.  The  sum 
of  the  transmitted  baseband  and  the  regenerated 
high-frequency  band  constitute  the  excitation  u(t) 
to  the  synthesizer. 


In  a  baseband  coder,  synthetic  speech  quality 
la  determined  by  four  factors:  a)  width  of  the 
baseband,  b)  coding  of  the  baseband,  c)  estimation 
and  coding  of  spectral  parameters,  and  d)  the 
high-frequency  regeneration  (HFR)  method  employed. 
In  this  paper  we  shall  concentrate  mainly  on  the 
fourth  factor,  HFR. 

2.  BASEBAND  CODERS 

Fig.  2  shows  the  transmitter  portion  of  a 
digital  baseband  coder,  based  on  a  linear 
prediction  representation.  In  this  and  other 
figures,  we  indicate  the  Nyquist  frequency  (half 
the  sampling  frequency)  in  parentheses  at  each 
point  in  the  process.  The  speech  signal  sCt), 
sampled  at  2W  Hz,  is  first  filtered  with  the  LPC 
Inverse  filter  A(z)  to  produce  the  residual  e(t). 
(The  parameters  of  A(z)  are  transmitted 
separately.)  Baseband  extraction  is  usually  in  the 
form  of  a  lowpass  or  bandpass  filter  of  width  B  Hz. 
The  signal  b(t)  is  known  as  the  baseband  residual, 
which  is  decimated  to  a  signal  v(t)  sampled  at 
2B  Hz.  v(t)  is  then  coded  and  transmitted. 

The  coding  may  be  quite  simple,  using  adaptive 
quantization,  or  may  be  more  complicated,  employing 
adaptive  predictive  coding  (ARC),  adaptive 
transform  coding  (ATC),  or  sub-band  coding  (SBC). 
Only  APC  has  been  used  previously  in  coding  the 
baseband  residual  [8],  while  SBC  has  been  used  in 
coding  a  baseband  of  the  speech  signal  Itself  [7]. 
In  Section  6  we  propose  the  use  of  ATC  in  baseband 
residual  coding. 

Fig.  3  shows  a  block  diagram  of  a  typical 
receiver  for  a  baseband  coder.  The  difference 
between  the  interpolated  signal  b'(t)  and  the 
baseband  residual  b(t)  should  be  primarily  due  to 
quantization  and  coding.  The  signal  b'(t)  goes 
through  a  HFR  process  and  a  highpass  filter  that 
keeps  only  the  high  frequencies  not  contained  in 
b'(t).  The  sum  of  the  resulting  signal  and  b'(t) 
form  u(t),  the  excitation  to  the  all-pole  LPC 
synthesizer,  which  produces  the  output  speech. 

3.  HIGH-FREQUENCY  REGENERATION 

It  is  well-known  that  if  the  baseband  has 
either  the  voice  fundamental  or  at  least  two 
adjacent  harmonics,  a  waveform  containing  all  the 
harmonics  of  voiced  input  speech  can  be  generated 
by  feeding  the  baseband  signal  to  an  instantaneous, 
zero-memory,  nonlinear  device.  The  spectral  shape 
of  the  regenerated  harmonic  structure  cay  be  quite 
arbitrary  and  must  be  flattened  to  produce  a 
suitable  excitation  function. 

A  digital  implementation  of  a  general  HFR 
system  is  shown  in  Fig.  R.  High  frequencies  are 
generated  by  performing  some  form  of  nonlinear 
distortion  on  the  baseband  signal.  Because  such 


Olatortlon  may  Introduce  energy  at  frequencies 
higher  than  W,  It  Is  recomaended  that  the  baseband 
be  Interpolated  to  at  least  double  the  original 
sampling  rate  before  the  distortion,  In  order  to 
avoid  spectral  aliasing  [7]  which  can  cause 
roughness  In  the  output  speech.  The  distorted 
signal  Is  then  spectrally  flattened  before  it  is 
used  as  excitation  to  the  synthesizer.  In  most 
systems  to  date,  a  noise  source  Is  added  to  the 
distorted  signal  to  compensate  for  the  loss  of  high 
frequencies  in  fricatives.  However,  we  have  found 
that  if  the  spectral  flattening  is  performed  in 
some  optimal  way  (using  LPC,  for  example),  the 
noise  source  Is  unnecessary. 

Rectification 

Most  nonlinear  distortion  schemes  to  date  use 
some  form  of  waveform  rectification.  In  general,  a 
rectifier  operating  on  a  signal  x(t)  has  the 
following  input-output  characteristic 

y(t)  I  [(Ua)lx(t)l  ♦  (1-a)x(t)]/2  (1) 

where  !•!  denotes  absolute  value,  and  a  Is  some 
constant  In  the  range  Ola<.l .  a=0  corresponds  to 
half-wave  rectification  and  a:1  corresponds  to 
full-wave  rectification.  Both  of  these  values  have 
been  used  in  the  past,  as  well  as  a  value  of  a =0.5 
[6].  We  now  show  that  the  value  of  a  should  have 
n&  effect  aa  Ul£.  cutout 

We  assume  the  signal  x(t)  to  be  bandllmited  to 
B  Hz;  hence,  x(t)  has  no  energy  above  B  Hz.  The 
full-wave  rectified  signal  lx(t)l  will  have  energy 
at  frequencies  above  3  Hz.  It  is  clear  from  (1), 
then,  that  the  high-frequency  energy  in  y(t)  above 
B  Hz  is  completely  determined  by  that  in  lx(t)l. 
Except  for  a  scaling  factor,  the  spectra  of  y(t) 
and  |x(t)l  are  identical  above  B  Hz.  The  spectral 
shape  of  y(t)  below  B  Hz  depends  on  the  value  of  c. 
However,  we  note  from  Fig.  3  that  only  the  region 
above  B  Hz  is  taken  from  the  output  of  KFR. 
Therefore,  we  conclude  that  ^  does  not  affect  the 
shape  of  the  spectrum  of  the  excitation  u(t)  and 
consequently  does  not  affect  the  output  speech. 
This  conclusion  has  been  borne  out  in  experiments. 
The  user  is  then  free  to  choose  the  type  of 
rectification  based  on  ease  of  implementation, 
without  the  choice  having  an  effect  on  the  output 
speech. 

.Snentral  Flattening 

Schroeder  et  al.  [l-3l  used  waveform  clipping 
to  perform  spectral  flattening.  More  recently,  Un 
and  Magill  [5]  used  double  differencing  to 
emphasize  high  frequencies,  while  Weinstein  [6] 
used  LPC  spectral  flattening.  On  the  other  hand, 
Esteban  et  al.  (71  did  not  report  any  spectral 
flattening  in  their  system. 

We  have  used  adaptive  LPC  spectral  flattening 
In  conjunction  with  full-wave  rectification  to 
Implement  a  9.5  kb/s  baseband  coder.  Although  the 
output  speech  was  of  high  quality,  a  small  amount 
of  roughness  could  be  perceived  upon  listening  with 
a  headset. 

A.  HFR  BT  SPECTRAL  DUPLICATION 

Here  we  present  a  new  HFR  method  based  on 
duplication  of  the  baseband  spectrum.  The  idea 
behind  the  new  method  derives  from  the 
pitch-excited  coder.  In  voiced  excitation,  the 
spectra's  of  the  excitation  is  a  flat  line  spectrum 
at  multiples  of  the  fundamental  pitch  frequency. 
Such  a  spectral  structure  Is  periodic  and 


repetitive:  the  high-frequency  structure  Is  the 
same  as  at  low  frequencies.  The  spectrum  of 
unvoiced  excitation,  on  the  other  hand.  Is 
continuous  and  has  a  random  spectrum  with  a  flat 
envelope.  However,  the  details  of  the  unvoiced 
spectrum  are  not  as  perceptually  important  as  the 
details  of  the  voiced  spectrum.  Therefore,  the 
unvoiced  spectrum  can  be  considered  repetitive 
also,  in  that  any  similar  spectrum  can  be 
substituted  with  equally  good  results. 

The  new  regeneration  method,  then.  Is  simply 
to  duplicate  the  baseband  spectrum  at  higher 
frequencies,  in  some  fashion.  We  shall  show 

systems  that  perform  spectral  duplication  in  each 
of  two  ways:  a)  spectral  folding,  and  b)  spectral 
translation.  Below,  we  shall  assume  that  the 

signal  bandwidth  W  la  an  integer  multiple  of  the 
baseband  width  B,  and  we  shall  denote  W/B=L.  This 
integer-band  assumption  greatly  simplifies  the  two 
Implementations. 

Fig.  5  shows  the  desired  results  for 

Fig.  5a  shows  the  baseband  residual  spectrum;  Fig. 
5b  shows  the  result  of  spectral  folding,  and  Fig. 
5e  the  result  of  spectral  translation.  In  Fig.  5b, 
the  spectrum  in  the  second  band  (between  B  and  2B) 
is  the  mirror  image  (folded  version)  of  the 

baseband,  while  the  spectrum  in  the  third  band  is  a 
folded  version  of  the  spectrum  in  t.he  second  band, 
which  makes  the  spectrum  in  the  third  band 
Identical  In  shape  to  the  baseband  spectrum.  In 
Fig.  5c,  the  second  and  third  bands  have  spectra 
identical  to  the  baseband.  One  can  think  of  the 
higher  bands  being  obtained  as  a  result  of 
translating  the  baseband. 

Spectral  Folding 

Fig.  6  shows  the  complete  receiver  using 
integer-band  spectral  folding.  To  perform  an 
L-band  spectral  folding,  one  simply  inserts  L-1 
zeros  between  samples  of  the  transmitted  baseband. 
This  process  is  merely  that  of  upsampling,  which  is 
well  known  to  produce  spectral  folding 
automatically.  Therefore,  essentially  no 
computations  are  needed  to  generate  the  excitation. 
Because  the  baseband  residual  has  a  flat  spectrum, 
the  excitation  will  have  a  flat  spectrum,  and  there 
is  no  need  for  spectral  flattening. 

.  Jranslation 

Fig.  7  shows  the  general  system  for 
integer-band  spectral  translation.  Note  that 
multiplication  by  (-I)'"  gives  a  signal  with  the 
mirror  image  spectrum.  Again  in  Fig.  7  we  use 
upsampling  to  produce  spectral  folding.  The  filter 
H(z)  is  a  multiple  bandpass  filter,  as  shown  In 
Fig.  8  for  L-J,  which  passes  those  bands  that  have 
the  same  shape  as  the  baseband,  l.e.,  every  other 
band.  The  filter  1-H(z)  then  passes  the 
intervening  bands.  The  sum  of  the  outputs  of  the 
two  filters  constitutes  the  excitation  for  the 
synthesizer.  The  computations  implied  In  Fig.  ^ 
may  be  reduced  substantially  by  making  use  of 
appropriate  symmetries. 

5.  INITIAL  RESULTS 

In  preliminary  experiments  using  HFR  by 
spectral  folding  we  heard  a  number  of  distortions 
In  the  form  of  added  tones.  These  tones  were 
generally  more  audible  with  a  larger  number  of 
bands  (l.e.,  larger  L)  and  for  higher-pitched 
voices.  We  were  able  to  verify  the  existence  of  a 
tone  at  even  multiples  of  the  folding  frequency, 
l.e.,  at  multiples  of  2B  Hz.  The  tone  was  largely 


eliminated  by  a  simple  method:  we  subtracted  off 
the  short-term  d.o.  in  the  baseband  signal,  because 
the  d.c.  is  folded  into  multiples  of  2B  Hz.  After 
spectral  folding,  the  original  d.c.  was  restored  so 
as  not  to  disturb  the  average  signal  level,  but  the 
energy  at  multiples  of  2B  Hz  had  already  been 
eliminated. 

Other  audible  tones  have  been  more  difficult 
to  trace.  However,  we  are  currently  working  on 
this  problem.  We  are  also  experimenting  with  the 
alternative  spectral  translation  method. 

Although  spectral  folding  seems  to  generate 
certain  background  low-level  tones,  it  does  not 
seem  to  have  any  perceptible  roughness,  as  was  the 
case  in  rectification.  We  see  this  difference  as  a 
tradeoff  between  the  two  methods  at  this  time. 

One  reason  for  the  existence  of  these 
background  tones  may  be  the  fact  that,  with 
spectral  duplication,  the  harmonic  structure  is 
Interrupted  at  multiples  of  B  Hz.  Therefore,  one 
could  hypothesize  that  the  tones  may  be  eliminated 
by  adjusting  the  width  B  of  the  baseband  to  be  a 
multiple  of  the  pitch  fundamental  frequency  on  a 
short-term  basis.  Unfortunately,  such  a  scheme 
would  require  an  enormous  amount  of  computation, 
which  would  offset  the  initial  reduction  in 
computation  afforded  by  the  spectral  duplication 
method.  However,  if  somehow  one  could  perform  the 
baseband  coding  in  the  frequency  domain  instead  of 
the  time  domain,  pitch-adaptive  spectral 
duplication  could  be  accomplished  very  easily. 
This  is  the  basis  for  our  proposed  system  in  the 
next  section. 

6.  BASEBAND  ADAPTIVE  T.BANSFORM  CODE.H 

The  idea  here  is  to  use  ATC  to  code  the 
baseband  residual.  In  ATC,  the  time  signal  is 
transformed  to  another  domain,  quantized  and  coded 
in  that  domain.  If  the  discrete  cosine  transform 
is  used,  then  the  coding  is  in  the  frequency 
dom'ln. 

In  addition  to  the  usual  analysis  at  the 
transmitter,  one  also  extracts  the  pitch  for  each 
frame.  At  the  receiver,  the  baseband  frequency 
components  are  duplicated  at  higher  frequencies. 
The  folding  or  translation  frequency  can  be  easily 
changed  each  frame  to  be  a  multiple  of  the  pitch 
fundamental  frequency.  Note  that  the  fact  that  the 
signal  bandwidth  will  not  be  an  integer  multiple  of 
the  now  pitch-adaptive  baseband,  poses  no  problems 
at  all.  Therefore,  it  makes  this  method  very 
general . 

7.  CONCLUSIONS 

We  discussed  the  problem  of  high-frequency 
regeneration  (HER)  in  baseband  coders.  We  showed 
that  all  forms  of  rectification  are  equivalent  In 
their  performance  as  HER  agents.  When  adaptive  LPC 
is  used  to  perform  speoti al  flattening  on  the 
rectified  baseband,  it  was  found  that  the  addition 
of  extra  noise  at  high  frequencies  was  not 
necessary.  However,  a  small  amount  of  roughness 
was  perceived  at  9.6  kb/s. 

Two  forms  of  spectral  duplication,  spectral 
folding  and  spectral  translation,  were  suggested  as 
new  HER  methods.  Initial  results  showed  the 
existence  of  background  tones,  some  of  which  have 
been  successfully  eliminated.  As  a  way  to  remedy 
the  break  in  the  harmonic  structure  produced  by 
spectral  duplication,  we  proposed  a  pitch-adaptive 
baseband  coding  system  using  adaptive  transform 


coding  with  spectral  duplication  in  the  frequency 
domain. 
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Fig.  1  Basic  synthesis  model  for  baseband  coder. 
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Fig.  2  Transmitter  for  a  baseband  coder. 
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Fig.  3  Typical  receiver  for  a  baseband  coder. 
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Fig.  U  High-frequency  regeneration  by  nonlinear  distortion  and  spectral 
flattening.  The  output  signal  in  this  fig-ure  is  the  output  of 
the  HFR  block  in  Fig.  3. 
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ABSTRACT 

We  report  on  the  synthesia  of  speech  In  the 
context  of  a  phonetic  vocoder  operating  at  100  b/a. 
With  each  phoneme,  the  vocoder  tranamita  the 
duration  and  a  single  pitch  value.  The  synthesizer 
uses  a  large  inventory  of  diphone  "models"  to 
synthesize  a  desired  phoneme  string.  The  diphone 
inventory  has  been  selected  to  differentiate 
between  prevooalio  and  postvooalio  allophones  of 
sonorants,  to  account  for  changes  in  vowel  color 
conditioned  by  postvocaiio  liquids,  to  allow  exact 
specification  of  voice  onset  time,  and  to  permit 
synthesis  of  glottal  stops,  alveolar  flaps  and 
syllabic  consonants.  The  diphones  are  extracted 
from  carefully  constructed  short  utterances  and  are 
stored  as  a  sequence  of  LPC  parameters.  During 
synthesis,  the  requisite  diphone  models  are 
time-warped,  abutted  and  smoothed  to  produce  a 
complete  sequence  of  LPC  parameters  that  are  used 
in  the  synthesia.  The  algorithms  used  are 
described  and  compared  with  more  conventional 
methods.  Examples  of  the  synthesized  speech  will 
be  played . 


1.  INTRODUCTION 

We  have  developed  a  phonetic  speech 
synthesizer  that  is  designed  to  operate  as  the 
synthesis  part  of  a  very-low-rate  (less  than  lOQ 
bits  per  second)  speech  transmission  system  [iL 
It  also  has  application  as  part  of  text-to-speeoh 
and  voice  response  systems. 

A  string  of  phonemes  to  be  synthesized  is 
first  translated  into  a  corresponding  diphone 
sequence.  A  diphone  is  defined  as  the  region  from 
the  middle  of  one  phoneme  to  the  middle  of  the  next 
phoneme.  A  diphone  template  consists  of  the  LPC 
parameters  necessary  for  synthesizing  one  diphone. 
The  LPC  parameters,  which  are  assembled  by 
concatenating  diphone  templates,  are  then  used  to 
synthesize  speech. 

Dixon  and  Maxey  [2l  used  diphone  segment 
assembly  in  a  terminal  analog  speech  synthesis 
system.  There  have  been  a  numDer  of  recent  efforts 
at  dlphone-UPC  synthesis  CB,**].  One  important  goal 
of  these  recent  systems  was  to  minimize  the  storage 
and  complexity.  For  instance,  the  "dyads"  used  by 
Olive  were  each  represented  by  only  two  seta  of 
iog-area-ratio  (LAR)  parameters.  In  addition, 
Olive  used  simple  linear  interpolation  of  LAR 
parameters  for  reconstructing  parameter  tracks  and 
for  connecting  between  successive  dyads.  Olive 
also  avoided  using  all  permutations  of  phonemes, 
thereby  reducing  the  number  of  dyads  needed. 

The  overriding  goal  in  our  effort  Is  to 
produce  natural- sounding  speech  of  a  quality 


comparable  to  that  of  unquantized  LPC  voooded 
speech.  Consequently,  we  have  resisted 
compromising  speech  quality  by  imposing  limits  on 
either  storage  or  complexity.  For  instance, 
whenever  a  diphone  is  significantly  affected  by 
context  we  use  a  context  dependent  diphone.  We 
have  retained,  however,  the  requirement  of  bei.ng 
able  to  specify  the  input  to  the  synthesizer  using 
less  than  100  bits  per  second. 

The  following  sections  of  this  paper  describe 
our  dlphone  synthesis  method  in  greater  detail . 
Where  appropriate,  we  shall  compare  our  method  to 
existing  methods  for  diphone/dyad  synthesis. 

2.  DIPHONE  SYNTHESIS  METHOD 

The  diphone  is  a  natural  unit  for  synthesis 
because  the  ooartioulatory  influence  of  one  phoneme 
does  not  usually  extend  much  further  than  half  way 
into  the  next  phoneme.  Since  dlphone  junctures  are 
usually  at  articulatory  steady  states,  minimal 
smoothing  is  required  between  adjacent  diphones. 
Also  the  difficult  task  of  duplicating  phoneme 
transitions  by  complicated  acoustic-phonetic  rules 
is  avoided.  We  estimate  that  approximately  2000 
diphones  are  needed  to  achieve  high  quality. 

A  fundamental  design  decision  in  this  research 
was  to  use  diphone  templates  that  have  been 
extracted  from  real  speech.  We  designed  and 
recorded  a  set  of  nonsense  utterances  that  contain 
all  the  possible  diphones  of  Eng.'.ish.  The  nonsense 
utterances  in  which  these  diphones  were  embedded 
were  designed  to  provide  a  phonetically  .leutral 
context,  in  order  that  the  dlphone  templates  be 
applicable  in  a  wide  range  of  contexts.  We  also 
recorded  those  triphones  (diphones  in  context)  that 
were  felt  to  be  necessary.  Whenever  a  diphone 
template  extracted  from  a  nonsense  utterance  is 
found  to  be  inadequate,  we  extract  a  template  from 
a  more  appropriate  word  or  phrase . 

Each  template  contains  14  LAR  parameters  and 
the  gain  for  each  frame  of  speech.  The  phoneme 
boundary  (as  indicated  by  manual  transcription)  is 
preserved. 

Figure  1  Illustrates  the  synthesis  procedure. 
The  speech  synthesis  program 

t )  expands  the  input  phoneme  sequence  into  a 
diphone  sequence; 

2)  selects  the  most  appropriate  dlphone  template 
(depending  on  the  local  phonetic  context); 

3)  time-warps  each  of  the  diphone  templates  to 
produce  a  gain  track  and  .4  LAR  parameter  tracks 
of  the  specified  durations; 

4)  smooths  between  consecutive  warped  diphone 
templates  to  minimize  gain  and  spectral 
discontinuities; 

5)  reconstructs  continuous  pitch  tracks  by  linear 
Interpolation  of  the  single  pitch  values  given; 


Fig.  1  Dlphone  synthesis  method. 

5)  determines  the  cutoff  frequency  and  voicing 
using  knowledge  of  the  pnoneoe  being 
synthesized; 

7)  uses  the  resulting  sequence  of  LARs,  pitch, 
gain,  and  cutoff  frequency  (specified  every  10 
ms)  as  input  to  control  an  LPC  speech 

synthesizer. 

3.  SYNTHESIS  ALGORITHMS 

Steps  1  and  7  given  above  are  relatively 
straightforward  procedures.  This  section  will 
discuss  the  procedures  in  steps  2  through  6  in 
greater  detail. 

Extension  Lfl.  PiBbanea 

In  this  section,  we  describe  some  extensions 
to  the  fundamental  definition  of  a  diphone.  These 
extensions  allow  "special  cases"  to  be  handled 
uniformly  by  the  synthesis  program  (thus  permitting 
the  highest  possible  intelligibility  and 
naturalness) . 

Context-Specific  Dishones.  Most  diphones  can 
be  used  adequately  independent  of  context,  but 
there  are  important  exceptions.  For  instance,  in 
the  sequence  [W-IH-Ll,  as  in  the  word  "will",  the 
art  of  the  phoneme  [IHJ  contained  in  the  diphone 
W-IH]  is  drastically  affected  by  the  presence  of 
the  [L].  Consequently,  there  must  be  a  separate 
dlphone  template  for  Cw-IHJ  to  be  used  in  this 
context.  To  account  for  such  phenomena  we  have 
allowed  more  than  one  template  to  be  defined  for 
diphones  when  they  are  affected  by  context. 
Additional  diphone  templates  necessitated  by 
lateralization,  retroflexion  and  other  strong 
contextual  phenomena  are  expected  to  account  for 
10-201  of  the  total  diphone  inventory. 


Splitting  Phonemes .  Some  phonemes  that  have 
more  than  one  acoustically  distinct  region  have 
been  split  up  into  two  "pseudophonemes" .  The 
diphthongs,  for  example,  have  two  very  different 
acoustic  regions.  For  instance,  the  diphtnong 
[AY],  as  in  "bite",  starts  out  much  like  an  [AA], 
as  in  "pot",  but  ends  up  more  like  the  [lYl  in 
"beet".  The  two  relatively  steady  regions  are 
connected  by  a  rapid  transition  between  them.  The 
durations  of  the  two  steady  regions  are  somewnac 
Independent,  as  are  the  contextual  effects  of 
neighboring  pnonemes.  Therefore,  some  of  tne 
diphthongs  have  been  split  into  two  pseudophonemes 
(e.g.,  TaY1,AY2]),  which  appear  only  in  sequence. 
The  unvoiced  plosives  and  affricates  also  have  two 
acoustically  distinct  regions.  Eacn  region  is 
treated  as  if  it  were  a  separate  pnoneme. 

Both  the  context-dependent  diphone  templates 
and  the  pseudophonemes  described  above  are 
determinable  from  a  normal  phoneme  string.  The 
only  change  in  data  rate  is  due  to  t.he  extra 
pseudophoneme  duration  (about  2  bits). 

Time  Warping 

In  order  to  provide  input  to  the  LPC  synthesis 
program  we  must  specify  LPC  coefficients  at  fixed 
Intervals  (10  ms)  by  time  warping  template 
Information  (whose  duration  is  fixed)  to  satisfy 
phoneme  durations  specified  by  the  input.  This  is 
made  difficult  by  the  fact  that  the  time  warping 
must  preserve  the  naturalness  of  speech.  One  way 
to  do  this  is  to  treat  speech  as  being  made  up  of 
elastic  and  inelastic  regions. 

The  principle  of  distinguishing  between 
"elastic"  and  "inelastic"  regions  of  the  template 
arises  from  observations  of  speech  parameters  under 
widely  varying  speaking  rates.  Most  of  the 
durational  variation  is  opserved  to  occur  during 
the  "steady  state"  portion  of  the  pnoneme  (an 
elastic  region) ,  whereas  the  transition  portions 
(inelastic  regions)  are  relatively  insensitive  to 
changes  in  speaking  rate.  Hence,  our  time  warping 
algorithm  allows  us  to  specify  that  a  certain 
percentage  of  the  diohone  template  on  each  side  of 
the  phoneme  boundary  is  to  be  treated  as  relatively 
inelastic  and  the  rest  of  the  dlphone  template  (the 
section  corresponding  to  phoneme  middles)  as  more 
elastic. 

Time  warping  is  accomplished  by  the  use  of 
piecewise  linear  mapping  functions,  which  define 
that  part  of  the  template  to  be  used  at  each 
instant  of  time.  This  correspondence  and  the 
resulting  mapping  function  are  illustrated  in 
Figure  2.  The  speech  being  synthesized  is  the 
sequence  /-  DH  AX  M/  as  in  "The  man...".  The 
vertical  axis  represents  time  in  the  dlpnor.e 
templates.  The  horizontal  axis  represents  time  in 
the  synthesized  speech.  The  diphone  template 
durations  are  determined  from  the  original  recorded 
nonsense  utterances,  while  the  phoneme  duratibns 
are  determined  by  the  input  utterance  to  be 
vocoded. 

The  phoneme  boundaries  in  the  diphone 
templates  are  mapped  onto  the  pnoneme  boundaries  in 
the  synthesized  speech  to  uniquely  define  a  point 
in  the  piecewise  linear  mapping  function.  The 
dlphone  boundary  is  mapped  onto  the  center  of  the 
phonemes.  The  DASHED  lines  connect  phoneme 
boundaries,  and  the  DASH-DOT  lines  connect  dlphone 
boundaries  (phoneme  centers).  Notice  that  the 
templates  are  generally  compressed  during  the 
mapping.  The  reason  for  this  is  that  (whenever 
possible)  our  dlphone  templates  are  extracted  from 
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Fig.  2  Piecewise  linear  dlphone  mapping  function. 

fully  articulated  short  utterances.  Furthermore, 
we  believe  that  compressing  a  long  template  will 
result  in  better  speech  quality  than  expanding  a 
short  one,  since  information  is  more  easily  ignored 
than  generated. 

The  "elastic"  and  "inelastic"  regions  are 
delineated  by  small  tic  marks  on  both  axes.  The 
knees  In  the  mapping  function  shown  correspond  to 
the  Intersection  of  these  tic  marks.  Although  both 
elastic  and  inelastic  regions  of  the  template  Un 
this  example)  are  shortened  by  mapping  from 
templates  to  phonemes,  the  inelastic  regions  are 
shortened  less. 

As  part  of  our  test  of  the  warping  algorithm, 
we  tried  varying  the  rate  of  the  synthesized 
speech.  When  the  time  warping  was  uniform,  some 
phoneme  transitions  (e.g.,  vowel/nasal  transitions) 
became  less  intelligible.  The  time  warping 
relations  were  varied  such  that  all  the  transitions 
sounded  natural  over  a  wide  range  of  speech  rates. 

faranetsr  CantlaullY 

When  templates  are  concatenated, 
discontinuities  in  the  gain  and  the  LAR  parameters 
say  result  despite  the  fact  that  the  parameters  are 
smooth  within  each  template.  Ih  order  to  deal  with 
this,  we  define  an  "interpolation  region"  that 
straddles  the  dlphone  template  boundary  (phoneme 
middle)  which  contains  potential  parametric 
discontinuities.  This  Interpolation  region  is 
defined  by  two  "Interpolation  points"  (one  from 
each  dlphone). 


Figure  3  compares  two  different  parameter 
smoothing  algorithms  as  applied  to  a  single 
parameter.  The  heavy  line  indicates  the  parameter 


Fig.  3  Parameter  Smoothing,  (a)  Linear  inter¬ 
polation,  (b)  Adding  ramps  with  equal 
slope . 

tracks  as  taken  from  the  two  dlphone  templates. 

Taken  together  these  tracks  span  one  phoneme . 

There  is  a  discontinuity  at  the  diphone  boundary, 
which  is  Indicated  by  the  vertical  DASHED  line. 
The  two  vertical  DASH-DOT  lines  delineate  the 
Interpolation  region.  The  straight  line  (a) 

connecting  the  two  parameter  tracks  illustrates  the 
effect  of  linear  interpolation.  As  can  be  seen  In 
this  case  the  linear  interpolation  results  in  a 
poor  fit  to  the  original  data.  The  other  curve  (b) 
connecting  the  two  points  is  derived  by  adding  a 
ramp  (shown  at  the  bottom  of  the  figure)  to  each 
parameter  track,  such  that  the  discontinuity  is 
eliminated.  This  second  smoothing  method  often 
yields  a  better  fit  to  the  original  parameter 

tracks  than  linear  interpolation.  Note  that  the 
second  method  requires  that  the  parameter  tracks  be 
represented  in  the  dlphone  templates  by  more  than 
two  frames  per  diphone.  It  is  expected  that  4  to  6 
frames  per  dlphone  will  be  sufficient. 


In  addition  to  14  LAR  parameters  and  gain,  the 
LPC  synthesis  program  requires  for  each  to  ms  frame 
a  value  of  pitch,  a  voicing  flag,  and  a  cutoff 
frequency. 

Pitch  Track.  The  pitch  track  during  voiced 
phonemes  Is  reconstructed  from  the  input  pitch 
values  by  straight  line  Interpolation.  In  our 
simulation  of  the  analysis  part  of  a  phonetic 
vocoder,  the  single  transmitted  pitch  values  are 
determined  from  complete  pitch  tracks  in  the 
sentence  being  analyzed.  The  analysis  program 
(given  the  phoneme  identities  and  phoneme 
boundaries)  determines  a  weighted  piecewise  linear 
least  squares  fit  to  that  pitch  track.  The 
endpoints  of  the  linear  sections  (which  occur  at 
phoneme  boundaries)  are  transmitted.  The  weighting 
is  designed  to  minimize  the  effect  of  pitch  tracker 
errors.  It  was  found  that  sentences  synthesized 
using  these  piecewise  linear  pitch  tracks  are 
practically  Indistinguishable  from  those  using  the 
original  analyzed  pitch  tracks. 


i 


Voleing.  Voicing  was  determined  directly  from 
the  Identity  of  the  phoneme  being  synthesized. 
Voicing  errors  are  avoided  by  careful  placement  of 
phoneme  boundaries  In  the  dipf.one  temolates.  We 
have  found  that  a  one  frame  er.-or  in  placement  of 
the  boundary  can  cause  a  severe  "pop"  in  the 
synthesized  speech,  due  to  misalignment  of  spectral 
parameters  with  excitation  parameters. 

Mixed-Source  Model  -  Cutoff  Freouencv. 
Previous  work  l5,b]  has  shown  that  our  mixed-source 
model  of  excitation  results  in  more  natural 
sounding  (less  buzzy)  speech  by  allowing  for 
specification  of  a  cutoff  frequency  witn  every 
value  of  pitch.  The  voicing  excitation  is  low-pass 
filtered  and  the  frlcation  is  nigh-pa.ss  filtered. 
The  cutoff  frequency  simultaneously  narks  the  upper 
edge  of  the  voicing  spectrum  and  the  lower  edge  of 
the  frlcation  spectrum.  Currently  we  have 
Implemented  an  algorithm  that  selects  a  cutoff 
frequency  based  on  distinctive  features  of  the 
phoneme  being  synthesized.  For  example,  for  vowels 
the  cutoff  frequency  is  at  5000  Hz  (fully  voiced;; 
for  unvoiced  consonants  it  is  0  Hz;  and  far  voiced 
fricatives  (which  are  produced  with  both  periodic 
and  random  excitation)  it  is  1500  Hz.  The  cutoff 
frequency  parameter  track  is  then  low-pass  filtered 
In  time  in  order  to  minimize  excitation 
discontinuities  at  phoneme  boundaries.  The 
Implementation  of  the  cutoff-frequency  algorithm 
has  resulted  in  a  marked  Improvement  in  speech 
quality:  in  particular,  a  decrease  in  buzziness. 

4.  HESULTS 

Several  sentences  have  been  synthesized  using 
the  techniques  described  above.  The  resulting 
speech  was  highly  intelligible.  There  are 
occasional  instances  where  the  intensity  of  a  vowel 
is  Inappropriate.  This  is  to  be  expected  since 
vowel  Intensity  varies  with  stress  -  which  is  not 
given  as  an  input  to  the  synthesizer.  To  alleviate 
this  problem,  a  gain  adjustment  could  be  estimated 
from  knowledge  of  the  phoneme,  its  duration,  and 
associated  pitch  value  (in  relation  to  those  data 
for  neighboring  phonemes).  Alternatively,  in  a 
speech  transmission  system,  the  analyzer  could  send 
a  phoneme  specific  gain  adjustment  for  each  vowel 
using  a  small  number  of  bits  (probably  one). 
(Since  there  are  only  about  5  vowels  per  second  in 
normal  speech,  this  represents  a  small  increase  in 
bit  rate.)  In  a  text-to-speech  or 
synthesis-by-rule  application,  this  gain  adjustment 
is  derived  from  syllable  stress  information. 

The  quality  of  unquantized  LPC  voooded  speech 
was  Judged  to  be  somewhat  better  than  that  of 
synthesized  speech;  however,  the  differences  were 
not  large.  Obviously  the  quality  of  the 
synthesized  speech  cannot  be  any  better  than  that 
of  vocoded  speech.  Nonetheless,  our  aim  is  to 
achieve  that  quality. 

We  tried  synthesizing  the  speech  of  a  second 
(male)  speaker  by  using  the  phoneme  duration  and 
pitch  values  from  the  new  speaker  in  conjunction 
with  the  diphone  templates  extracted  from  the 
speech  of  our  model  speaker.  The  synthesized 
speech  had  definite  characteristics  of  both 
speakers.  However,  the  characteristics  of  the  new 
speaker  (as  conveyed  in  the  pitch  and  duration 
values)  were  dominant. 


5.  DATA  REDUCTION 

During  this  initial  phase  of  the  research,  we 
have  preserved  all  of  the  template  data  without 
frame  rate  reduction  or  parameter  quantization.  We 
anticipate  that  we  shall  be  able  to  achieve  a  data 
reduction  comparable  to  t.hat  used  in  normal  LPC 
vocoding.  Allowing  for  a  frame  rate  of  4  to  6 
frames  per  diphone  (twice  the  frame  rate  of 
variable  rate  LPC  vocoding) ,  we  would  project  a 
total  storage  requirement  of  Less  than  400,000 
bits.  While  this  may  be  2  or  3  times  that  required 
by  simpler  methods,  we  feel  that,  for  many 
applications,  the  improved  naturalness  will  be 
worth  the  added  storage. 

6.  CONCLUSION 

We  have  described  a  phonetic  speech  synthesis 
program  that  uses  dlphone  template  concatenation 
and  LPC  synthesis  to  produce  highly  intelligible 
speech  of  quality  comparable  to  that  of  LPC  voooded 
speech.  Every  effort  has  been  taken  to  preserve 
the  information  necessary  and  use  algorithms 
sufficiently  complex  to  maximize  the  naturalness  of 
the  synthesized  speech. 
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